首页 > 最新文献

2014 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
Integrating FPGA-based processing elements into a runtime for parallel heterogeneous computing 将基于fpga的处理元素集成到并行异构计算的运行时中
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082807
David de la Chevallerie, Jens Korinth, A. Koch
In this work, we present an approach how FPGA-based computing can be integrated into a heterogeneous computing environment in an embedded systems context, using the x1 Ort run-time of the X10 language system as a case-study. To this end, we present a hardware/software framework for pools of reconfigurable compute elements, and show how high-level synthesis can be employed to generate the actual processing cores. Our framework is sufficiently lean to deliver high performance FPGA implementations even at high area utilization (operating at 250 MHz with up to 90% of the device area used), and capable of low-latency access to pools of dozens of instances of custom IP cores, automatically generated by high-level synthesis tools.
在这项工作中,我们提出了一种方法,如何将基于fpga的计算集成到嵌入式系统上下文中的异构计算环境中,使用X10语言系统的x1 Ort运行时作为案例研究。为此,我们提出了一个可重构计算元素池的硬件/软件框架,并展示了如何使用高级综合来生成实际的处理核心。我们的框架足够精简,即使在高区域利用率(250 MHz工作,高达90%的设备面积使用)下也能提供高性能FPGA实现,并且能够低延迟访问数十个自定义IP内核实例池,由高级合成工具自动生成。
{"title":"Integrating FPGA-based processing elements into a runtime for parallel heterogeneous computing","authors":"David de la Chevallerie, Jens Korinth, A. Koch","doi":"10.1109/FPT.2014.7082807","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082807","url":null,"abstract":"In this work, we present an approach how FPGA-based computing can be integrated into a heterogeneous computing environment in an embedded systems context, using the x1 Ort run-time of the X10 language system as a case-study. To this end, we present a hardware/software framework for pools of reconfigurable compute elements, and show how high-level synthesis can be employed to generate the actual processing cores. Our framework is sufficiently lean to deliver high performance FPGA implementations even at high area utilization (operating at 250 MHz with up to 90% of the device area used), and capable of low-latency access to pools of dozens of instances of custom IP cores, automatically generated by high-level synthesis tools.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"314-317"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79304951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS 比较TILT叠加处理器与OpenCL HLS的性能、生产率和可扩展性
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082748
Rafat Rashid, J. Steffan, Vaughn Betz
High-Level-Synthesis (HLS) tools translate a software description of an application into custom FPGA logic, increasing designer productivity vs. Hardware Description Language (HDL) design flows. Overlays seek to further improve productivity by reducing application compile times and raising abstraction by enabling the designer to target a software-programmable substrate instead of the underlying FPGA. We compare the performance, development effort and scalability of two C-to-FPGA approaches: our TILT overlay processor and Altera's OpenCL HLS. Our application-customized TILT implementations of five data-parallel benchmarks have from 41 % to 80% of the throughput per unit of layout area achieved by our best OpenCL HLS designs. The time required for initial hardware compilation of these TILT designs and configuration of the target application onto the overlay is roughly comparable to the compile times of the OpenCL HLS designs: 28 and 103 minutes on average respectively. However subsequent reconfigurations due to changes in the application that do not require re-synthesis of the overlay are fast, taking 38 seconds on average. In contrast, OpenCL HLS applications require full recompilation after every code change. TILT also enables smaller, more area-efficient designs than OpenCL HLS when low to moderate throughput is sufficient. For high throughput, the larger spatially pipelined designs of OpenCL HLS are preferable.
高级综合(HLS)工具将应用程序的软件描述转换为定制的FPGA逻辑,与硬件描述语言(HDL)设计流程相比,提高了设计人员的工作效率。通过减少应用程序编译时间和提高抽象性,使设计人员能够针对软件可编程基板而不是底层FPGA, Overlays寻求进一步提高生产力。我们比较了两种C-to-FPGA方法的性能、开发工作量和可扩展性:我们的TILT覆盖处理器和Altera的OpenCL HLS。我们的应用程序定制的5个数据并行基准的TILT实现,每单位布局面积的吞吐量是我们最好的OpenCL HLS设计的41%到80%。这些TILT设计的初始硬件编译和目标应用程序在覆盖上的配置所需的时间与OpenCL HLS设计的编译时间大致相当:平均分别为28分钟和103分钟。然而,由于应用程序的变化而导致的后续重新配置(不需要重新合成覆盖层)速度很快,平均需要38秒。相比之下,OpenCL HLS应用程序在每次代码更改后都需要完全重新编译。当低到中等吞吐量就足够时,TILT还可以实现比OpenCL HLS更小、更高效的设计。对于高吞吐量,更大的空间流水线设计的OpenCL HLS是可取的。
{"title":"Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS","authors":"Rafat Rashid, J. Steffan, Vaughn Betz","doi":"10.1109/FPT.2014.7082748","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082748","url":null,"abstract":"High-Level-Synthesis (HLS) tools translate a software description of an application into custom FPGA logic, increasing designer productivity vs. Hardware Description Language (HDL) design flows. Overlays seek to further improve productivity by reducing application compile times and raising abstraction by enabling the designer to target a software-programmable substrate instead of the underlying FPGA. We compare the performance, development effort and scalability of two C-to-FPGA approaches: our TILT overlay processor and Altera's OpenCL HLS. Our application-customized TILT implementations of five data-parallel benchmarks have from 41 % to 80% of the throughput per unit of layout area achieved by our best OpenCL HLS designs. The time required for initial hardware compilation of these TILT designs and configuration of the target application onto the overlay is roughly comparable to the compile times of the OpenCL HLS designs: 28 and 103 minutes on average respectively. However subsequent reconfigurations due to changes in the application that do not require re-synthesis of the overlay are fast, taking 38 seconds on average. In contrast, OpenCL HLS applications require full recompilation after every code change. TILT also enables smaller, more area-efficient designs than OpenCL HLS when low to moderate throughput is sufficient. For high throughput, the larger spatially pipelined designs of OpenCL HLS are preferable.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"50 1","pages":"20-27"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91386502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
Hardware Trojan detection acceleration based on word-level statistical properties management 基于字级统计属性管理的硬件木马检测加速
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082769
He Li, Qiang Liu
Hardware Trojan insertion has raised serious concerns to semiconductor industry and government agencies. Hardware Trojan is usually activated under rare conditions associated with low transition bits in a circuit. The damage includes circuit functional failure or important information leakage. Previous research on hardware Trojan detection is mainly based on side-channel analysis and Trojan activation. Long activation time is a major concern during the detection process. In this paper, we propose a novel approach for efficiently accelerating Trojan activation by increasing the transition activity of rare bits. In particular, the proposed approach increases the bit-level transition activity by controlling signal word-level statistical properties, such as changing the variance and autocorrelation of the signal. In addition, by analyzing the signal propagation statistical properties through various digital signal processing (DSP) operators such as adders and multipliers, the proposed approach can control the statistical properties of internal signals and then enhance the internal bit transition activity from the primary input of the circuit. The proposed approach is evaluated on several circuits. The results show that the transition activity of rare bits can be dramatically increased by up to 166.7 times and Trojan activation time can be reduced by up to 121 times.
硬件木马植入已经引起了半导体行业和政府部门的严重关注。硬件木马通常在电路中与低转换位相关的罕见条件下被激活。损坏包括电路功能故障或重要信息泄露。以往对硬件木马检测的研究主要基于侧信道分析和木马激活。在检测过程中,激活时间长是一个主要问题。在本文中,我们提出了一种通过增加稀有比特的跃迁活度来有效加速木马激活的新方法。特别是,该方法通过控制信号字级统计特性(如改变信号的方差和自相关)来增加比特级转移活动。此外,通过分析各种数字信号处理(DSP)运算符(如加法器和乘法器)对信号传播的统计特性,该方法可以控制内部信号的统计特性,从而从电路的一次输入增强内部比特转移活动性。在几个电路上对所提出的方法进行了评估。结果表明,稀有比特的跃迁活性可显著提高166.7倍,木马激活时间可缩短121倍。
{"title":"Hardware Trojan detection acceleration based on word-level statistical properties management","authors":"He Li, Qiang Liu","doi":"10.1109/FPT.2014.7082769","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082769","url":null,"abstract":"Hardware Trojan insertion has raised serious concerns to semiconductor industry and government agencies. Hardware Trojan is usually activated under rare conditions associated with low transition bits in a circuit. The damage includes circuit functional failure or important information leakage. Previous research on hardware Trojan detection is mainly based on side-channel analysis and Trojan activation. Long activation time is a major concern during the detection process. In this paper, we propose a novel approach for efficiently accelerating Trojan activation by increasing the transition activity of rare bits. In particular, the proposed approach increases the bit-level transition activity by controlling signal word-level statistical properties, such as changing the variance and autocorrelation of the signal. In addition, by analyzing the signal propagation statistical properties through various digital signal processing (DSP) operators such as adders and multipliers, the proposed approach can control the statistical properties of internal signals and then enhance the internal bit transition activity from the primary input of the circuit. The proposed approach is evaluated on several circuits. The results show that the transition activity of rare bits can be dramatically increased by up to 166.7 times and Trojan activation time can be reduced by up to 121 times.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"96 1","pages":"153-160"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77623290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Efficient FPGA implementation of digit parallel online arithmetic operators 数字并行在线算术运算符的高效FPGA实现
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082763
Kan Shi, D. Boland, G. Constantinides
Online arithmetic has been widely studied for ASIC implementation. Online components were originally designed to perform computations in digit serial with most significant digit (MSD) first, resulting in the ability to chain arithmetic operators together for low latency. More recently, research has shown that digit parallel online operators can fail more gracefully when operating beyond the deterministic clocking region in comparison to operators with conventional arithmetic. Unfortunately, the utilization of online arithmetic operators in the past has required a large area overhead for FPGA implementation. In this paper, we propose novel approaches to implement the key primitives of online arithmetic, adders and multipliers, efficiently on modern Xilinx FPGAs with 6-input LUTs and carry resources. We demonstrate experimentally that in comparison to a direct RTL synthesis, the proposed architectures achieve slice savings of over 67% and 69%, and speed-ups of over 1.2x and 1.5x for adders and multipliers, respectively. As a result, the area overheads of using online adders and multipliers in place of traditional arithmetic primitives is reduced from 8.41 x and 8.11 x to 1.88x and 1.84x respectively. Finally, because an online multiplier generates MSDs first, we also demonstrate the method to create an online multiplier with a reduced precision output that is smaller than a traditional multiplier producing the same result. We show that this can lead to silicon area savings of up to 56%.
在线算法在ASIC实现中得到了广泛的研究。在线组件最初被设计为以最高有效位数(MSD)优先执行数字串行计算,从而能够将算术运算符链在一起以实现低延迟。最近,研究表明,与传统算法相比,数字并行在线算子在超出确定性时钟区域时可以更优雅地失败。不幸的是,过去在线算术运算符的使用需要很大的FPGA实现面积开销。在本文中,我们提出了新的方法来有效地实现在线算术,加法器和乘法器的关键原语,在具有6输入lut和携带资源的现代赛灵思fpga上。我们通过实验证明,与直接RTL合成相比,所提出的架构分别实现了超过67%和69%的切片节省,加法器和乘法器的加速分别超过1.2倍和1.5倍。因此,使用在线加法器和乘法器代替传统算术原语的面积开销分别从8.41 x和8.11 x减少到1.88x和1.84x。最后,由于在线乘法器首先生成msd,因此我们还演示了创建具有较低精度输出的在线乘法器的方法,该输出比产生相同结果的传统乘法器小。我们表明,这可以节省高达56%的硅面积。
{"title":"Efficient FPGA implementation of digit parallel online arithmetic operators","authors":"Kan Shi, D. Boland, G. Constantinides","doi":"10.1109/FPT.2014.7082763","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082763","url":null,"abstract":"Online arithmetic has been widely studied for ASIC implementation. Online components were originally designed to perform computations in digit serial with most significant digit (MSD) first, resulting in the ability to chain arithmetic operators together for low latency. More recently, research has shown that digit parallel online operators can fail more gracefully when operating beyond the deterministic clocking region in comparison to operators with conventional arithmetic. Unfortunately, the utilization of online arithmetic operators in the past has required a large area overhead for FPGA implementation. In this paper, we propose novel approaches to implement the key primitives of online arithmetic, adders and multipliers, efficiently on modern Xilinx FPGAs with 6-input LUTs and carry resources. We demonstrate experimentally that in comparison to a direct RTL synthesis, the proposed architectures achieve slice savings of over 67% and 69%, and speed-ups of over 1.2x and 1.5x for adders and multipliers, respectively. As a result, the area overheads of using online adders and multipliers in place of traditional arithmetic primitives is reduced from 8.41 x and 8.11 x to 1.88x and 1.84x respectively. Finally, because an online multiplier generates MSDs first, we also demonstrate the method to create an online multiplier with a reduced precision output that is smaller than a traditional multiplier producing the same result. We show that this can lead to silicon area savings of up to 56%.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"115-122"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79937125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Hardware architecture of bi-cubic convolution interpolation for real-time image scaling 实时图像缩放的双三次卷积插值硬件结构
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082790
Gopinath Mahale, H. Mahale, Rajesh Babu Parimi, S. Nandy, S. Bhattacharya
This paper presents two hardware architectures of bi-cubic convolution interpolation termed Parallelized Row Column Interpolation Architecture (PRCIA) and Serialized Row Column Interpolation Architecture (SRCIA) for real-time image scaling. These architectures factor in the challenges of high computational complexity, redundant computations and repeated memory accesses, which were otherwise not explicitly addressed in existing architectures. Besides, the proposed architectures also employ parallel computations to improve the throughput for realtime applications. The proposed architectures have been emulated and tested on Virtex-6 FPGA. The emulated PRCIA and SRCIA are able to scale input grayscale images of dimensions up to 640 × 480 at 59 and 48 frames per second respectively with arbitrary scaling factors up to 4 in both dimensions.
本文提出了两种双三次卷积插值的硬件结构:并行行列插值结构(PRCIA)和串行行列插值结构(SRCIA)。这些体系结构考虑了高计算复杂性、冗余计算和重复内存访问的挑战,否则在现有体系结构中没有明确解决这些问题。此外,所提出的架构还采用并行计算来提高实时应用的吞吐量。所提出的架构已经在Virtex-6 FPGA上进行了仿真和测试。仿真的PRCIA和SRCIA能够分别以59帧/秒和48帧/秒的速度缩放尺寸为640 × 480的输入灰度图像,并且在两个维度上的任意缩放因子都高达4。
{"title":"Hardware architecture of bi-cubic convolution interpolation for real-time image scaling","authors":"Gopinath Mahale, H. Mahale, Rajesh Babu Parimi, S. Nandy, S. Bhattacharya","doi":"10.1109/FPT.2014.7082790","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082790","url":null,"abstract":"This paper presents two hardware architectures of bi-cubic convolution interpolation termed Parallelized Row Column Interpolation Architecture (PRCIA) and Serialized Row Column Interpolation Architecture (SRCIA) for real-time image scaling. These architectures factor in the challenges of high computational complexity, redundant computations and repeated memory accesses, which were otherwise not explicitly addressed in existing architectures. Besides, the proposed architectures also employ parallel computations to improve the throughput for realtime applications. The proposed architectures have been emulated and tested on Virtex-6 FPGA. The emulated PRCIA and SRCIA are able to scale input grayscale images of dimensions up to 640 × 480 at 59 and 48 frames per second respectively with arbitrary scaling factors up to 4 in both dimensions.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"84 10 1","pages":"264-267"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91137093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A novel three-dimensional FPGA architecture with high-speed serial communication links 一种具有高速串行通信链路的三维FPGA结构
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082805
T. Kajiwara, Qian Zhao, M. Amagasaki, M. Iida, Morituro Kuga, T. Sueyoshi
Three-dimensional (3D) integrated circuit technology is expected to offer continual improvement to very-large-scale integration performance as the process of miniaturization approaches physical limits. However, because the through-silicon vias (TSVs) that are used to create interlayer vertical connections are much larger area than transistors, there is an inherent tradeoff between connectivity and small size. Field-programmable gate arrays (FPGAs) are particularly noted for requiring a high level of routing resources, which means that it is unrealistic to make the same number of connections vertically as horizontally. In previous research, we proposed a method for creating a two-layer compact 3D FPGA with face-down integration (the base FPGA). In this paper, we discuss stacking multiple base FPGAs by the face-up method and propose a method for achieving highspeed interlayer communications with TSV serial connections. The proposed architecture improves FPGA performance by using smaller TSVs. The evaluation results show that the proposed 3D FPGA can achieve a total area that is as low as 67% the equivalent two-dimensional FPGA.
随着小型化进程接近物理极限,三维集成电路技术有望为大规模集成性能提供持续改进。然而,由于用于创建层间垂直连接的硅通孔(tsv)的面积比晶体管大得多,因此在连接性和小尺寸之间存在固有的权衡。现场可编程门阵列(fpga)特别需要高水平的路由资源,这意味着在垂直方向上与水平方向上建立相同数量的连接是不现实的。在之前的研究中,我们提出了一种创建两层紧凑型3D FPGA的方法,该FPGA具有面向下集成(基础FPGA)。本文讨论了多基fpga的正面堆叠方法,并提出了一种利用TSV串行连接实现层间高速通信的方法。该架构通过使用更小的tsv来提高FPGA性能。评估结果表明,所提出的三维FPGA可实现的总面积低至等效二维FPGA的67%。
{"title":"A novel three-dimensional FPGA architecture with high-speed serial communication links","authors":"T. Kajiwara, Qian Zhao, M. Amagasaki, M. Iida, Morituro Kuga, T. Sueyoshi","doi":"10.1109/FPT.2014.7082805","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082805","url":null,"abstract":"Three-dimensional (3D) integrated circuit technology is expected to offer continual improvement to very-large-scale integration performance as the process of miniaturization approaches physical limits. However, because the through-silicon vias (TSVs) that are used to create interlayer vertical connections are much larger area than transistors, there is an inherent tradeoff between connectivity and small size. Field-programmable gate arrays (FPGAs) are particularly noted for requiring a high level of routing resources, which means that it is unrealistic to make the same number of connections vertically as horizontally. In previous research, we proposed a method for creating a two-layer compact 3D FPGA with face-down integration (the base FPGA). In this paper, we discuss stacking multiple base FPGAs by the face-up method and propose a method for achieving highspeed interlayer communications with TSV serial connections. The proposed architecture improves FPGA performance by using smaller TSVs. The evaluation results show that the proposed 3D FPGA can achieve a total area that is as low as 67% the equivalent two-dimensional FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"97 1","pages":"306-309"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88782408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A high-performance low-power near-Vt RRAM-based FPGA 一种高性能低功耗近vt随机存储器FPGA
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082777
Xifan Tang, P. Gaillardon, G. Micheli
The routing architecture, heavily using programmable switches, dominates the area, delay and power of Field Programmable Gate Arrays (FPGAs). Resistive Random Access Memories (RRAMs) enable high-performance routing architectures through the replacement of Static Random Access Memory (SRAM)-based programming switches. Exploiting the very low on-resistance state achievable by RRAMs, RRAM-based routing multiplexers can be used to significantly reduce the FPGA routing delays. In addition, RRAM-based routing architectures are less sensitive to supply voltage reductions and show promises in low-power FPGA designs. In this paper, we propose a near-Vt low-power RRAM-based FPGA where both delay and power reductions are achieved. Experimental results demonstrate that a near-Vi RRAM-based FPGA design leads to a 15% area shrink, a 10% delay reduction, and a 65% power improvement, compared to a conventional FPGA design for a given technology node. To achieve low on-resistance values, RRAMs typically require high programming currents. In other word, they need relatively large programming transistors, potentially resulting in area, delay and power inefficiencies. We also present a design methodology to properly size the programming transistors of RRAMs in order to further improve the area-efficiency. Experimental results show that a correct programming transistor sizing strategy contributes to further 18% area and 2% delay shrink, compared to the initial near-Vi RRAM-based FPGA.
大量使用可编程交换机的路由架构在现场可编程门阵列(fpga)的面积、延迟和功率方面占据主导地位。电阻式随机存取存储器(rram)通过替代基于静态随机存取存储器(SRAM)的编程开关实现高性能路由架构。利用rram可实现的极低导通电阻状态,基于rram的路由多路复用器可用于显着减少FPGA路由延迟。此外,基于ram的路由架构对电源电压降低不太敏感,并且在低功耗FPGA设计中表现出前景。在本文中,我们提出了一种接近vt的低功耗基于随机存储器的FPGA,可以实现延迟和功耗降低。实验结果表明,在给定的技术节点下,与传统FPGA设计相比,基于近vi ram的FPGA设计可使面积缩小15%,延迟降低10%,功耗提高65%。为了实现低导通阻值,rram通常需要高编程电流。换句话说,它们需要相对较大的编程晶体管,这可能导致面积、延迟和功率效率低下。我们还提出了一种设计方法,以适当的大小可编程晶体管的ram,以进一步提高面积效率。实验结果表明,与最初基于近vi ram的FPGA相比,正确的编程晶体管尺寸策略可以进一步减少18%的面积和2%的延迟。
{"title":"A high-performance low-power near-Vt RRAM-based FPGA","authors":"Xifan Tang, P. Gaillardon, G. Micheli","doi":"10.1109/FPT.2014.7082777","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082777","url":null,"abstract":"The routing architecture, heavily using programmable switches, dominates the area, delay and power of Field Programmable Gate Arrays (FPGAs). Resistive Random Access Memories (RRAMs) enable high-performance routing architectures through the replacement of Static Random Access Memory (SRAM)-based programming switches. Exploiting the very low on-resistance state achievable by RRAMs, RRAM-based routing multiplexers can be used to significantly reduce the FPGA routing delays. In addition, RRAM-based routing architectures are less sensitive to supply voltage reductions and show promises in low-power FPGA designs. In this paper, we propose a near-Vt low-power RRAM-based FPGA where both delay and power reductions are achieved. Experimental results demonstrate that a near-Vi RRAM-based FPGA design leads to a 15% area shrink, a 10% delay reduction, and a 65% power improvement, compared to a conventional FPGA design for a given technology node. To achieve low on-resistance values, RRAMs typically require high programming currents. In other word, they need relatively large programming transistors, potentially resulting in area, delay and power inefficiencies. We also present a design methodology to properly size the programming transistors of RRAMs in order to further improve the area-efficiency. Experimental results show that a correct programming transistor sizing strategy contributes to further 18% area and 2% delay shrink, compared to the initial near-Vi RRAM-based FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"48 1","pages":"207-214"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82217499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Real-time 3D reconstruction for FPGAs: A case study for evaluating the performance, area, and programmability trade-offs of the Altera OpenCL SDK fpga的实时3D重建:评估Altera OpenCL SDK的性能,面积和可编程性权衡的案例研究
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082810
Q. Gautier, A. Shearer, J. Matai, D. Richmond, Pingfan Meng, R. Kastner
Embedding real-time 3D reconstruction of a scene from a low-cost depth sensor can improve the development of technologies in the domains of augmented reality, mobile robotics, and more. However, current implementations require a computer with a powerful GPU, which limits its prospective applications with low-power requirements. To implement low-power 3D reconstruction we embedded two prominent algorithms of 3D reconstruction (Iterative Closest Point and Volumetric Integration) on an Altera Stratix V FPGA by using the OpenCL language and the Altera OpenCL SDK. In this paper, we present our application and evaluation of the Altera tool in terms of performance, area, and programmability trade-offs. We have verified that OpenCL can be a viable method for developing FPGA applications by modifying an open-source version of the Microsoft KinectFusion project to run partially on a FPGA.
从低成本的深度传感器嵌入实时3D场景重建可以改善增强现实,移动机器人等领域的技术发展。然而,目前的实现需要具有强大GPU的计算机,这限制了其低功耗要求的潜在应用。为了实现低功耗的3D重建,我们使用OpenCL语言和Altera OpenCL SDK在Altera Stratix V FPGA上嵌入了两种著名的3D重建算法(迭代最近点和体积积分)。在本文中,我们介绍了Altera工具在性能、面积和可编程性方面的应用和评估。通过修改Microsoft KinectFusion项目的开源版本,使其部分运行在FPGA上,我们已经验证了OpenCL可以成为开发FPGA应用程序的可行方法。
{"title":"Real-time 3D reconstruction for FPGAs: A case study for evaluating the performance, area, and programmability trade-offs of the Altera OpenCL SDK","authors":"Q. Gautier, A. Shearer, J. Matai, D. Richmond, Pingfan Meng, R. Kastner","doi":"10.1109/FPT.2014.7082810","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082810","url":null,"abstract":"Embedding real-time 3D reconstruction of a scene from a low-cost depth sensor can improve the development of technologies in the domains of augmented reality, mobile robotics, and more. However, current implementations require a computer with a powerful GPU, which limits its prospective applications with low-power requirements. To implement low-power 3D reconstruction we embedded two prominent algorithms of 3D reconstruction (Iterative Closest Point and Volumetric Integration) on an Altera Stratix V FPGA by using the OpenCL language and the Altera OpenCL SDK. In this paper, we present our application and evaluation of the Altera tool in terms of performance, area, and programmability trade-offs. We have verified that OpenCL can be a viable method for developing FPGA applications by modifying an open-source version of the Microsoft KinectFusion project to run partially on a FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"186 1","pages":"326-329"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77022044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Deep and narrow binary content-addressable memories using FPGA-based BRAMs 基于fpga的bram的深度和窄二进制内容可寻址存储器
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082808
Ameer Abdelhadi, G. Lemieux
Binary Content Addressable Memories (BCAMs) are massively parallel search engines capable of searching the entire memory space in a single clock cycle. BCAMs are used in a wide range of applications, such as memory management, networks, data compression, DSP, and databases. Due to the increasing amount of processed information, modern BCAM applications demand a deep searching space. However, traditional BCAM approaches in FPGAs suffer from storage inefficiency. In this paper, a novel and efficient technique for constructing deep and narrow BCAMs out of standard SRAM blocks in FPGAs is proposed. This technique is most efficient for deep and narrow CAMs since the BRAM consumption is exponential to pattern width. Using Altera's Stratix V device, traditional methods achieve up to 64K-entry BCAM while the proposed technique achieves up to 4M entries. For the 64K-entry test-case, traditional methods consume 43 times more ALMs and achieves only one-third of the Fmax. A fully parameterized Verilog implementation is available1. This implementation has been extensively tested using Altera's tools.
二进制内容可寻址存储器(BCAMs)是一种大规模并行搜索引擎,能够在一个时钟周期内搜索整个内存空间。bcam广泛应用于内存管理、网络、数据压缩、DSP和数据库等领域。由于处理的信息量越来越大,现代BCAM应用需要更大的搜索空间。然而,fpga中传统的BCAM方法存在存储效率低下的问题。本文提出了一种利用fpga中标准SRAM块构建深、窄bcam的新颖高效技术。这种技术对于深和窄的凸轮是最有效的,因为BRAM消耗是模式宽度的指数。使用Altera公司的Stratix V设备,传统方法可以实现高达64k次的BCAM,而该技术可以实现高达4M次的BCAM。对于64k条目的测试用例,传统方法消耗43倍的alm,只达到Fmax的三分之一。一个完全参数化的Verilog实现是可用的。这个实现已经使用Altera的工具进行了广泛的测试。
{"title":"Deep and narrow binary content-addressable memories using FPGA-based BRAMs","authors":"Ameer Abdelhadi, G. Lemieux","doi":"10.1109/FPT.2014.7082808","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082808","url":null,"abstract":"Binary Content Addressable Memories (BCAMs) are massively parallel search engines capable of searching the entire memory space in a single clock cycle. BCAMs are used in a wide range of applications, such as memory management, networks, data compression, DSP, and databases. Due to the increasing amount of processed information, modern BCAM applications demand a deep searching space. However, traditional BCAM approaches in FPGAs suffer from storage inefficiency. In this paper, a novel and efficient technique for constructing deep and narrow BCAMs out of standard SRAM blocks in FPGAs is proposed. This technique is most efficient for deep and narrow CAMs since the BRAM consumption is exponential to pattern width. Using Altera's Stratix V device, traditional methods achieve up to 64K-entry BCAM while the proposed technique achieves up to 4M entries. For the 64K-entry test-case, traditional methods consume 43 times more ALMs and achieves only one-third of the Fmax. A fully parameterized Verilog implementation is available1. This implementation has been extensively tested using Altera's tools.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"7 1","pages":"318-321"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90767201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Memory security in reconfigurable computers: Combining formal verification with monitoring 可重构计算机中的存储器安全:将形式验证与监控相结合
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082771
T. Wiersema, Stephanie Drzevitzky, M. Platzner
Ensuring memory access security is a challenge for reconfigurable systems with multiple cores. Previous work introduced access monitors attached to the memory subsystem to ensure that the cores adhere to pre-defined protocols when accessing memory. In this paper, we combine access monitors with a formal runtime verification technique known as proof-carrying hardware to guarantee memory security. We extend previous work on proof-carrying hardware by covering sequential circuits and demonstrate our approach with a prototype leveraging ReconOS/Zynq with an embedded ZUMA virtual FPGA overlay. Experiments show the feasibility of the approach and the capabilities of the prototype, which constitutes the first realization of proof-carrying hardware on real FPGAs. The area overheads for the virtual FPGA are measured as 2x-10x, depending on the resource type. The delay overhead is substantial with almost 100x, but this is an extremely pessimistic estimate that will be lowered once accurate timing analysis for FPGA overlays become available. Finally, reconfiguration time for the virtual FPGA is about one order of magnitude lower than for the native Zynq fabric.
确保内存访问安全性是多核可重构系统面临的一个挑战。以前的工作引入了连接到内存子系统的访问监视器,以确保内核在访问内存时遵守预定义的协议。在本文中,我们将访问监视器与称为携带证明硬件的正式运行时验证技术相结合,以保证内存安全性。我们通过覆盖顺序电路扩展了以前在承载证明硬件上的工作,并通过利用带有嵌入式ZUMA虚拟FPGA覆盖的ReconOS/Zynq原型演示了我们的方法。实验证明了该方法的可行性和样机的性能,构成了验证硬件在实际fpga上的首次实现。根据资源类型的不同,虚拟FPGA的面积开销为2 -10倍。延迟开销很大,几乎是100倍,但这是一个极其悲观的估计,一旦FPGA覆盖层的精确时序分析可用,延迟开销将会降低。最后,虚拟FPGA的重新配置时间比原生Zynq结构低一个数量级。
{"title":"Memory security in reconfigurable computers: Combining formal verification with monitoring","authors":"T. Wiersema, Stephanie Drzevitzky, M. Platzner","doi":"10.1109/FPT.2014.7082771","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082771","url":null,"abstract":"Ensuring memory access security is a challenge for reconfigurable systems with multiple cores. Previous work introduced access monitors attached to the memory subsystem to ensure that the cores adhere to pre-defined protocols when accessing memory. In this paper, we combine access monitors with a formal runtime verification technique known as proof-carrying hardware to guarantee memory security. We extend previous work on proof-carrying hardware by covering sequential circuits and demonstrate our approach with a prototype leveraging ReconOS/Zynq with an embedded ZUMA virtual FPGA overlay. Experiments show the feasibility of the approach and the capabilities of the prototype, which constitutes the first realization of proof-carrying hardware on real FPGAs. The area overheads for the virtual FPGA are measured as 2x-10x, depending on the resource type. The delay overhead is substantial with almost 100x, but this is an extremely pessimistic estimate that will be lowered once accurate timing analysis for FPGA overlays become available. Finally, reconfiguration time for the virtual FPGA is about one order of magnitude lower than for the native Zynq fabric.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"167-174"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75816079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
期刊
2014 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1