首页 > 最新文献

2013 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
Hardware acceleration of biomedical models with OpenCMISS and CellML 基于OpenCMISS和CellML的生物医学模型硬件加速
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718390
Ting-Rong Yu, C. Bradley, O. Sinnen
OpenCMISS is a mathematical modeling environment designed to solve field based equations and link subcellular and tissue-level biophysical processes to organ-level processes. It employs a general purpose parallel design, in particular distributed memory, for its computations. CellML is a mark up language based on XML that is designed to encode lumped parameter biophysically based systems of ordinary differential equations and nonlinear algebraic equations. OpenCMISS allows CellML models to be evaluated and integrated into models at various spatial and temporal scales. With good inherent parallelism, hardware acceleration based on FPGAs has a great potential to increase the computational performance and to reduce the energy consumption of computations with CellML models integrated in OpenCMISS. However, with several hundred CellML models, manual hardware implementation for each CellML model is complex and time consuming. The advantages of FPGA designs will only be realised if there is a general solution or a tool to automatically convert CellML models into hardware description languages such as VHDL. In this paper we describe the architecture for the FPGA hardware implementation of CellML models and evaluate the first results related to performance and resource usage based on a variety of criteria.
OpenCMISS是一个数学建模环境,旨在解决基于场的方程,并将亚细胞和组织水平的生物物理过程与器官水平的过程联系起来。它的计算采用了通用的并行设计,特别是分布式内存。CellML是一种基于XML的标记语言,设计用于编码基于常微分方程和非线性代数方程的集总参数生物物理系统。OpenCMISS允许在各种空间和时间尺度上对CellML模型进行评估并集成到模型中。基于fpga的硬件加速具有良好的并行性,在OpenCMISS集成CellML模型的计算中具有提高计算性能和降低能耗的巨大潜力。然而,由于有数百个CellML模型,每个CellML模型的手动硬件实现既复杂又耗时。只有当有通用的解决方案或工具自动将CellML模型转换为硬件描述语言(如VHDL)时,FPGA设计的优势才会实现。在本文中,我们描述了CellML模型的FPGA硬件实现的架构,并基于各种标准评估了与性能和资源使用相关的第一个结果。
{"title":"Hardware acceleration of biomedical models with OpenCMISS and CellML","authors":"Ting-Rong Yu, C. Bradley, O. Sinnen","doi":"10.1109/FPT.2013.6718390","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718390","url":null,"abstract":"OpenCMISS is a mathematical modeling environment designed to solve field based equations and link subcellular and tissue-level biophysical processes to organ-level processes. It employs a general purpose parallel design, in particular distributed memory, for its computations. CellML is a mark up language based on XML that is designed to encode lumped parameter biophysically based systems of ordinary differential equations and nonlinear algebraic equations. OpenCMISS allows CellML models to be evaluated and integrated into models at various spatial and temporal scales. With good inherent parallelism, hardware acceleration based on FPGAs has a great potential to increase the computational performance and to reduce the energy consumption of computations with CellML models integrated in OpenCMISS. However, with several hundred CellML models, manual hardware implementation for each CellML model is complex and time consuming. The advantages of FPGA designs will only be realised if there is a general solution or a tool to automatically convert CellML models into hardware description languages such as VHDL. In this paper we describe the architecture for the FPGA hardware implementation of CellML models and evaluate the first results related to performance and resource usage based on a variety of criteria.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122051950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Derivation of efficient FSM from loop nests 基于环巢的高效FSM的推导
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718367
Tomofumi Yuki, Antoine Morvan, Steven Derrien
Pipelined execution is one of the most important optimizations in hardware design to improve hardware utilization rate, and hence the throughput. Loop pipelining is a transformation available in High Level Synthesis tools to execute multiple iterations of a loop in a pipeline. Nested loop pipelining is a related technique that improves hardware utilization rate when the iteration count of the innermost loop is small. However, it is also known to increase the complexity of the control, and hence degrading frequency. In this paper, we present an automatic transformation targeting HLS that improves the effectiveness of nested loop pipelining, by efficient implementations of the control-path. Specifically, we present (i) an analytical model that captures the trade-off between gain in cycles and loss in frequency, (ii), automatic derivation of efficient Finite State Machine from loop nests, and (iii) an efficient implementation of the derived FSM that improves the performance of synthesized hardware.
流水线执行是硬件设计中最重要的优化之一,可以提高硬件利用率,从而提高吞吐量。循环流水线是高级合成工具中可用的转换,用于在管道中执行循环的多个迭代。嵌套循环流水线是一种相关的技术,当最内层循环的迭代次数较少时,可以提高硬件利用率。然而,它也增加了控制的复杂性,从而降低了频率。在本文中,我们提出了一种针对HLS的自动转换,通过有效地实现控制路径来提高嵌套循环流水线的有效性。具体来说,我们提出了(i)一个分析模型,该模型捕获了周期增益和频率损失之间的权衡,(ii),从环路巢中自动推导有效的有限状态机,以及(iii)派生的FSM的有效实现,该FSM提高了综合硬件的性能。
{"title":"Derivation of efficient FSM from loop nests","authors":"Tomofumi Yuki, Antoine Morvan, Steven Derrien","doi":"10.1109/FPT.2013.6718367","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718367","url":null,"abstract":"Pipelined execution is one of the most important optimizations in hardware design to improve hardware utilization rate, and hence the throughput. Loop pipelining is a transformation available in High Level Synthesis tools to execute multiple iterations of a loop in a pipeline. Nested loop pipelining is a related technique that improves hardware utilization rate when the iteration count of the innermost loop is small. However, it is also known to increase the complexity of the control, and hence degrading frequency. In this paper, we present an automatic transformation targeting HLS that improves the effectiveness of nested loop pipelining, by efficient implementations of the control-path. Specifically, we present (i) an analytical model that captures the trade-off between gain in cycles and loss in frequency, (ii), automatic derivation of efficient Finite State Machine from loop nests, and (iii) an efficient implementation of the derived FSM that improves the performance of synthesized hardware.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127694836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Flexible hierarchy ray tracing on FPGAs fpga上的柔性层次光线追踪
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718379
Sam Collinson, O. Sinnen
Rendering programs use ray tracing to artificially create photo-realistic scenes that would otherwise be too dangerous, too costly or physically impossible to fabricate. Acceleration of the rendering process can be achieved through spatial or object hierarchy structures, which aim to restrict the number of expensive ray-object intersection calculations along a ray path by trading them for traversal of the structure. With extensive inherent parallelism, ray tracing benefits from GPU acceleration but may also benefit from the more flexible control flow and memory architecture available with FPGAs. We present a flexible FPGA based ray tracing platform capable of traversing varying widths and types of acceleration hierarchies to evaluate their efficiency. The platform consists of four main controllers for communication, traversal, intersection and memory. The platform interfaces with LuxRays, an open-source C++ renderer, over PCIexpress to transfer data for computation to onboard memory. We implement a configuration of the platform at 250MHz on our target device that shows promising results compared to CPU and GPU renders.
渲染程序使用光线追踪来人为地创建逼真的场景,否则这些场景太危险、太昂贵或在物理上不可能制作。渲染过程的加速可以通过空间或对象层次结构来实现,其目的是通过交换结构的遍历来限制沿射线路径进行昂贵的射线-对象相交计算的数量。由于具有广泛的固有并行性,光线追踪受益于GPU加速,但也可能受益于fpga提供的更灵活的控制流和内存架构。我们提出了一个灵活的基于FPGA的光线追踪平台,能够遍历不同宽度和类型的加速层次来评估它们的效率。该平台由四个主要控制器组成,分别用于通信、遍历、交叉和存储。该平台通过PCIexpress与开源c++渲染器LuxRays接口,将用于计算的数据传输到板载内存。我们在目标设备上实现了250MHz的平台配置,与CPU和GPU渲染相比,显示出有希望的结果。
{"title":"Flexible hierarchy ray tracing on FPGAs","authors":"Sam Collinson, O. Sinnen","doi":"10.1109/FPT.2013.6718379","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718379","url":null,"abstract":"Rendering programs use ray tracing to artificially create photo-realistic scenes that would otherwise be too dangerous, too costly or physically impossible to fabricate. Acceleration of the rendering process can be achieved through spatial or object hierarchy structures, which aim to restrict the number of expensive ray-object intersection calculations along a ray path by trading them for traversal of the structure. With extensive inherent parallelism, ray tracing benefits from GPU acceleration but may also benefit from the more flexible control flow and memory architecture available with FPGAs. We present a flexible FPGA based ray tracing platform capable of traversing varying widths and types of acceleration hierarchies to evaluate their efficiency. The platform consists of four main controllers for communication, traversal, intersection and memory. The platform interfaces with LuxRays, an open-source C++ renderer, over PCIexpress to transfer data for computation to onboard memory. We implement a configuration of the platform at 250MHz on our target device that shows promising results compared to CPU and GPU renders.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128131628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Teaching FPGA security FPGA安全性教学
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718373
L. Bossuet
Teaching FPGA security to electrical engineering students is new at graduate level. It requires a wide field of knowledge and a lot of time. This paper describes a compact course on FPGA security that is available to electrical engineering master's students at the Saint-Etienne Institute of Telecom, University of Lyon, France. It is intended for instructors who wish to design a new course on this topic. The paper reviews the motivation for the course, the pedagogical issues involved, the curriculum, the lab materials and tools used, and the results. Details are provided on two original lab sessions, in particular, a compact lab that requires students to perform differential power analysis of FPGA implementation of the AES symmetric cipher.
在研究生阶段,向电气工程专业的学生讲授FPGA安全性是一个新的课题。它需要广泛的知识和大量的时间。本文介绍了一门紧凑的FPGA安全课程,该课程面向法国里昂大学圣艾蒂安电信学院的电气工程硕士学生。它是为希望设计一个关于这个主题的新课程的教师准备的。本文回顾了课程的动机,所涉及的教学问题,课程设置,实验材料和使用的工具,以及结果。详细介绍了两个原始的实验课程,特别是一个紧凑的实验室,要求学生对AES对称密码的FPGA实现进行差分功率分析。
{"title":"Teaching FPGA security","authors":"L. Bossuet","doi":"10.1109/FPT.2013.6718373","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718373","url":null,"abstract":"Teaching FPGA security to electrical engineering students is new at graduate level. It requires a wide field of knowledge and a lot of time. This paper describes a compact course on FPGA security that is available to electrical engineering master's students at the Saint-Etienne Institute of Telecom, University of Lyon, France. It is intended for instructors who wish to design a new course on this topic. The paper reviews the motivation for the course, the pedagogical issues involved, the curriculum, the lab materials and tools used, and the results. Details are provided on two original lab sessions, in particular, a compact lab that requires students to perform differential power analysis of FPGA implementation of the AES symmetric cipher.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124310143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A 66.1 Gbps single-pipeline AES on FPGA 基于FPGA的66.1 Gbps单管道AES
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718392
Qiang Liu, Zhenyu Xu, Ye Yuan
Targeting real-time encryption/decryption of high speed data communication, this paper proposes an FPGA-based high throughput AES design. The critical functions involved in AES are broken into elementary logic operations to gain the deep insight into the performance bottleneck. With respect to FPGA structures, a datapath with two balanced pipeline stages is determined for each of the encryption/decryption rounds. Meanwhile, a new key expansion scheme with additional nonlinear operations is proposed to increase the security of the AES implementation and is well matched to the two-stage pipelining datapath. The design is evaluated on various FPGA devices and is compared with several existing AES implementations. Results show that in terms of both throughput and throughput per slice the proposed AES design with single pipeline can overcome most existing designs and achieves a throughput of 66.1 Gbps on a latest FPGA device.
针对高速数据通信的实时加解密问题,提出了一种基于fpga的高吞吐量AES设计方案。将AES中涉及的关键功能分解为基本逻辑操作,以深入了解性能瓶颈。对于FPGA结构,为每个加密/解密轮确定具有两个平衡管道阶段的数据路径。同时,为了提高AES实现的安全性,提出了一种附加非线性运算的密钥扩展方案,该方案与两阶段流水数据路径匹配良好。该设计在各种FPGA器件上进行了评估,并与几种现有的AES实现进行了比较。结果表明,在吞吐量和每片吞吐量方面,所提出的单管道AES设计可以克服大多数现有设计,并在最新的FPGA器件上实现66.1 Gbps的吞吐量。
{"title":"A 66.1 Gbps single-pipeline AES on FPGA","authors":"Qiang Liu, Zhenyu Xu, Ye Yuan","doi":"10.1109/FPT.2013.6718392","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718392","url":null,"abstract":"Targeting real-time encryption/decryption of high speed data communication, this paper proposes an FPGA-based high throughput AES design. The critical functions involved in AES are broken into elementary logic operations to gain the deep insight into the performance bottleneck. With respect to FPGA structures, a datapath with two balanced pipeline stages is determined for each of the encryption/decryption rounds. Meanwhile, a new key expansion scheme with additional nonlinear operations is proposed to increase the security of the AES implementation and is well matched to the two-stage pipelining datapath. The design is evaluated on various FPGA devices and is compared with several existing AES implementations. Results show that in terms of both throughput and throughput per slice the proposed AES design with single pipeline can overcome most existing designs and achieves a throughput of 66.1 Gbps on a latest FPGA device.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124423934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Multi-personality partitioning for heterogeneous systems 异构系统的多人格划分
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718375
A. Gregerson, Aman Chadha, Katherine Morrow
Design flows use graph partitioning both as a precursor to place and route for single devices, and to divide netlists or task graphs among multiple devices. Partitioners have accommodated FPGA heterogeneity via multi-resource constraints, but have not yet exploited the corresponding ability to implement some computations in multiple ways (e.g., LUTs vs. DSP blocks), which could enable a superior solution. This paper introduces multi-personality graph partitioning, which incorporates aspects of resource mapping into partitioning. We present a modified multi-level KLFM partitioning algorithm that also performs heterogeneous resource mapping for nodes with multiple potential implementations (multiple personalities). We evaluate several variants of our multi-personality FPGA circuit partitioner using 21 circuits and benchmark graphs, and show that dynamic resource mapping improves cut size on average by 27% over static mapping for these circuits. We further show that it improves deviation from target resource utilizations by 50% over post-partitioning resource mapping.
设计流使用图分区作为单个设备放置和路由的前兆,并在多个设备之间划分网络列表或任务图。分区器已经通过多资源约束适应了FPGA的异构性,但是还没有利用相应的能力以多种方式实现一些计算(例如,lut与DSP块),这可能会实现一个更好的解决方案。本文引入了多人格图划分,将资源映射的各个方面融入到划分中。我们提出了一种改进的多级KLFM划分算法,该算法还为具有多种潜在实现(多重个性)的节点执行异构资源映射。我们使用21个电路和基准图评估了我们的多人格FPGA电路分割器的几种变体,并表明动态资源映射比这些电路的静态映射平均提高了27%的切割尺寸。我们进一步表明,与分区后的资源映射相比,它将目标资源利用率的偏差提高了50%。
{"title":"Multi-personality partitioning for heterogeneous systems","authors":"A. Gregerson, Aman Chadha, Katherine Morrow","doi":"10.1109/FPT.2013.6718375","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718375","url":null,"abstract":"Design flows use graph partitioning both as a precursor to place and route for single devices, and to divide netlists or task graphs among multiple devices. Partitioners have accommodated FPGA heterogeneity via multi-resource constraints, but have not yet exploited the corresponding ability to implement some computations in multiple ways (e.g., LUTs vs. DSP blocks), which could enable a superior solution. This paper introduces multi-personality graph partitioning, which incorporates aspects of resource mapping into partitioning. We present a modified multi-level KLFM partitioning algorithm that also performs heterogeneous resource mapping for nodes with multiple potential implementations (multiple personalities). We evaluate several variants of our multi-personality FPGA circuit partitioner using 21 circuits and benchmark graphs, and show that dynamic resource mapping improves cut size on average by 27% over static mapping for these circuits. We further show that it improves deviation from target resource utilizations by 50% over post-partitioning resource mapping.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126421974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The architecture and placement algorithm for a uni-directional routing based 3D FPGA 基于单向路由的三维FPGA结构与布局算法
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718325
Junsong Hou, Heng Yu, Yajun Ha, Xin Liu
Three-Dimensional (3D) FPGA as a promising design trend, achieves significant performance improvement over conventional 2D-based FPGA. The maturity of the uni-directional routing architecture design, which achieves 25% area saving in area-delay-product (ADP) over bi-directional routing architectures, has driven major vendors such as Xilinx and Altera to switch to such architecture in their 2D-based products. However, few studies were contributed to exploring performance-optimal uni-directional 3D routing architectures. In this paper, we propose and evaluate a novel uni-directional 3D routing architecture named UNI-3D. Additionally, in the EDA counterpart, we also propose an improved simulated annealing (SA)-based placement algorithm that caters the unidirectional architecture, to alleviate signal propagation imbalance in the vertical channels resulted from using conventional bi-directional based SA approach. Our simulation results show that our proposed architecture is able to achieve up to 28.44% of delay reduction and 26.21% planar channel width reduction compared with the baseline 2D uni-directional architecture. At the same time, the proposed SA algorithm is able to improve the average vertical channel width up to 16% compared to state-of-the-art works.
三维(3D) FPGA作为一种很有前途的设计趋势,在性能上比传统的基于二维的FPGA有了显著的提高。与双向路由架构相比,单向路由架构设计的成熟使区域延迟产品(ADP)节省了25%的面积,这促使Xilinx和Altera等主要供应商在其基于2d的产品中切换到这种架构。然而,很少有研究对性能最优的单向3D路由架构做出贡献。在本文中,我们提出并评估了一种新的单向三维路由架构UNI-3D。此外,在EDA中,我们还提出了一种改进的基于模拟退火(SA)的放置算法,该算法适合单向架构,以缓解使用传统的基于双向的SA方法导致的垂直通道中的信号传播不平衡。仿真结果表明,与基准二维单向结构相比,我们提出的结构可以减少28.44%的时延,减少26.21%的平面信道宽度。同时,与现有算法相比,所提出的SA算法能够将平均垂直信道宽度提高16%。
{"title":"The architecture and placement algorithm for a uni-directional routing based 3D FPGA","authors":"Junsong Hou, Heng Yu, Yajun Ha, Xin Liu","doi":"10.1109/FPT.2013.6718325","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718325","url":null,"abstract":"Three-Dimensional (3D) FPGA as a promising design trend, achieves significant performance improvement over conventional 2D-based FPGA. The maturity of the uni-directional routing architecture design, which achieves 25% area saving in area-delay-product (ADP) over bi-directional routing architectures, has driven major vendors such as Xilinx and Altera to switch to such architecture in their 2D-based products. However, few studies were contributed to exploring performance-optimal uni-directional 3D routing architectures. In this paper, we propose and evaluate a novel uni-directional 3D routing architecture named UNI-3D. Additionally, in the EDA counterpart, we also propose an improved simulated annealing (SA)-based placement algorithm that caters the unidirectional architecture, to alleviate signal propagation imbalance in the vertical channels resulted from using conventional bi-directional based SA approach. Our simulation results show that our proposed architecture is able to achieve up to 28.44% of delay reduction and 26.21% planar channel width reduction compared with the baseline 2D uni-directional architecture. At the same time, the proposed SA algorithm is able to improve the average vertical channel width up to 16% compared to state-of-the-art works.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127409457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient methods for out-of-order load/store execution for high-performance soft processors 用于高性能软处理器的乱序加载/存储执行的有效方法
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718409
Henry Wong, Vaughn Betz, Jonathan Rose
As FPGAs continue to increase in size, it becomes increasingly feasible and desirable to build higher performance soft processors. Preserving the familiar single-threaded programming model can be done with an out of order processor. The ability to execute memory loads and stores out of order has a large impact on performance, but this is difficult to do because the dependencies between stores and loads are not known until addresses are computed. Out of order memory disambiguation is traditionally done with CAMs in the load queue and store queue, but large CAMs are inefficient on FPGAs. Store Queue Index Prediction (SQIP) and NoSQ propose to replace CAMs with store-load forwarding prediction and load re-execution. We implement four memory disambiguation schemes (in-order, CAM, SQIP, NoSQ) on a Stratix IV FPGA and evaluate the area and delay trade-offs. We find that CAM area and delay degrade quickly with load/store queue size, while SQIP and NoSQ have little degradation with queue size but have area overhead for prediction and predictor training hardware. SQIP and NoSQ use less area than CAMs beyond 32 and 16 load/store queue entries, respectively, and have higher maximum frequency beyond 4 entries.
随着fpga尺寸的不断增大,构建更高性能的软处理器变得越来越可行和可取。可以使用乱序处理器来保持熟悉的单线程编程模型。不按顺序执行内存负载和存储的能力对性能有很大影响,但这很难做到,因为在计算地址之前,存储和负载之间的依赖关系是未知的。乱序内存消歧传统上是通过加载队列和存储队列中的cam来完成的,但是大型cam在fpga上效率很低。存储队列索引预测(SQIP)和NoSQ提出用存储负载转发预测和负载重执行来取代CAMs。我们在Stratix IV FPGA上实现了四种内存消歧方案(in-order, CAM, SQIP, NoSQ),并评估了面积和延迟权衡。我们发现CAM的面积和延迟随着加载/存储队列的大小而迅速下降,而SQIP和NoSQ随着队列的大小而几乎没有下降,但会增加预测和预测器训练硬件的面积开销。SQIP和NoSQ分别比CAMs使用32和16个加载/存储队列条目更少的面积,并且超过4个条目的最大频率更高。
{"title":"Efficient methods for out-of-order load/store execution for high-performance soft processors","authors":"Henry Wong, Vaughn Betz, Jonathan Rose","doi":"10.1109/FPT.2013.6718409","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718409","url":null,"abstract":"As FPGAs continue to increase in size, it becomes increasingly feasible and desirable to build higher performance soft processors. Preserving the familiar single-threaded programming model can be done with an out of order processor. The ability to execute memory loads and stores out of order has a large impact on performance, but this is difficult to do because the dependencies between stores and loads are not known until addresses are computed. Out of order memory disambiguation is traditionally done with CAMs in the load queue and store queue, but large CAMs are inefficient on FPGAs. Store Queue Index Prediction (SQIP) and NoSQ propose to replace CAMs with store-load forwarding prediction and load re-execution. We implement four memory disambiguation schemes (in-order, CAM, SQIP, NoSQ) on a Stratix IV FPGA and evaluate the area and delay trade-offs. We find that CAM area and delay degrade quickly with load/store queue size, while SQIP and NoSQ have little degradation with queue size but have area overhead for prediction and predictor training hardware. SQIP and NoSQ use less area than CAMs beyond 32 and 16 load/store queue entries, respectively, and have higher maximum frequency beyond 4 entries.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115270642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Datapath fault tolerance for parallel accelerators 并行加速器的数据路径容错
Pub Date : 1900-01-01 DOI: 10.1109/FPT.2013.6718389
James J. Davis, P. Cheung
While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-on-chip platform, along with `bolt-on' logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.
虽然我们在晶体管密度和开关速度方面获得了工艺缩放的好处,但必须考虑到它造成的负面影响:增加变化,退化和故障敏感性。在设备层面上,这种现象及其引发的故障会导致产量下降、系统可靠性下降,在极端情况下,在成功运行一段时间后,会导致完全故障。虽然错误检测和纠正几乎总是考虑到高度敏感和易受影响的应用程序,如空间中的应用程序,但对于其他更通用的应用程序,它们往往被忽视。在本文中,我们提出了一个并行矩阵乘法加速器,运行在赛灵思Zynq片上系统平台的硬件上,以及用于检测,定位和避免其数据路径中的故障的“插口”逻辑。在资源开销和性能影响方面比较不同大小的设计。我们实现的最大的容错加速器在无故障运行期间,比同等的故障敏感设计多消耗17.3%的面积,运行频率低3.95%,执行时间减少18.8%。
{"title":"Datapath fault tolerance for parallel accelerators","authors":"James J. Davis, P. Cheung","doi":"10.1109/FPT.2013.6718389","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718389","url":null,"abstract":"While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-on-chip platform, along with `bolt-on' logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123852075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2013 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1