首页 > 最新文献

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
Reducing Overheads for Fault-Tolerant Datapaths with Dynamic Partial Reconfiguration 通过动态部分重构减少容错数据路径的开销
James J. Davis, P. Cheung
As process scaling and transistor count inflation continue, silicon chips are becoming increasingly susceptible to faults. Although FPGAs are particularly vulnerable to these effects, their runtime reconfigurability offers unique opportunities for fault tolerance. This work presents an application combining algorithmic-level error detection with dynamic partial reconfiguration (DPR) to allow faults manifested within its datapath at runtime to be circumvented at low cost.
随着工艺规模的扩大和晶体管数量的不断膨胀,硅芯片越来越容易出现故障。尽管fpga特别容易受到这些影响,但它们的运行时可重构性为容错提供了独特的机会。这项工作提出了一种将算法级错误检测与动态部分重构(DPR)相结合的应用程序,该应用程序允许以低成本规避运行时数据路径中出现的故障。
{"title":"Reducing Overheads for Fault-Tolerant Datapaths with Dynamic Partial Reconfiguration","authors":"James J. Davis, P. Cheung","doi":"10.1109/FCCM.2014.36","DOIUrl":"https://doi.org/10.1109/FCCM.2014.36","url":null,"abstract":"As process scaling and transistor count inflation continue, silicon chips are becoming increasingly susceptible to faults. Although FPGAs are particularly vulnerable to these effects, their runtime reconfigurability offers unique opportunities for fault tolerance. This work presents an application combining algorithmic-level error detection with dynamic partial reconfiguration (DPR) to allow faults manifested within its datapath at runtime to be circumvented at low cost.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121168666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
System-Level Retiming and Pipelining 系统级重定时和流水线
Girish Venkataramani, Y. Gu
In this paper, we examine retiming and pipelining in the context of system-level optimization techniques. Our main contributions are: (a) functionally equivalent retiming and delay balancing as necessary techniques for pipelining and retiming system-level graphs while maintaining numerical fidelity, and (b) clock-rate pipelining, as a new technique that leverages the knowledge of multi-rate design spec to pipeline multi-cycle paths. All these techniques have been implemented within HDL Coder™, a tool that generates synthesizable HDL code from Simulink ® and MATLAB®.
在本文中,我们在系统级优化技术的背景下研究重定时和流水线。我们的主要贡献是:(a)功能等效的重定时和延迟平衡,作为流水线和系统级图重定时的必要技术,同时保持数字保真度;(b)时钟速率流水线,作为一种利用多速率设计规范知识来流水线多周期路径的新技术。所有这些技术都已在HDL Coder™中实现,HDL Coder™是一种从Simulink®和MATLAB®生成可合成HDL代码的工具。
{"title":"System-Level Retiming and Pipelining","authors":"Girish Venkataramani, Y. Gu","doi":"10.1109/FCCM.2014.30","DOIUrl":"https://doi.org/10.1109/FCCM.2014.30","url":null,"abstract":"In this paper, we examine retiming and pipelining in the context of system-level optimization techniques. Our main contributions are: (a) functionally equivalent retiming and delay balancing as necessary techniques for pipelining and retiming system-level graphs while maintaining numerical fidelity, and (b) clock-rate pipelining, as a new technique that leverages the knowledge of multi-rate design spec to pipeline multi-cycle paths. All these techniques have been implemented within HDL Coder™, a tool that generates synthesizable HDL code from Simulink ® and MATLAB®.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"348 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115894093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Rapid Post-Map Insertion of Embedded Logic Analyzers for Xilinx FPGAs Xilinx fpga嵌入式逻辑分析仪的快速映射后插入
B. Hutchings, J. Keeley
A rapid post-map insertion of an embedded logic analyzer is discussed. The proposed technique makes use of otherwise unused resources in an already-mapped circuit and does not disturb the original placement and routing of the circuit. Using this technique, designers can add debugging circuitry to existing circuits and quickly modify the set of of observed signals in just a few minutes instead of waiting for a recompile of their circuit. All tests were performed on a Xilinx Virtex-5 FPGA.
讨论了嵌入式逻辑分析仪的快速映射后插入。所提出的技术利用了已映射电路中未使用的资源,并且不会干扰电路的原始放置和路由。使用这种技术,设计人员可以在现有电路中添加调试电路,并在几分钟内快速修改观察到的信号集,而不是等待重新编译电路。所有测试均在Xilinx Virtex-5 FPGA上进行。
{"title":"Rapid Post-Map Insertion of Embedded Logic Analyzers for Xilinx FPGAs","authors":"B. Hutchings, J. Keeley","doi":"10.1109/FCCM.2014.29","DOIUrl":"https://doi.org/10.1109/FCCM.2014.29","url":null,"abstract":"A rapid post-map insertion of an embedded logic analyzer is discussed. The proposed technique makes use of otherwise unused resources in an already-mapped circuit and does not disturb the original placement and routing of the circuit. Using this technique, designers can add debugging circuitry to existing circuits and quickly modify the set of of observed signals in just a few minutes instead of waiting for a recompile of their circuit. All tests were performed on a Xilinx Virtex-5 FPGA.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123699603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
FPGA Architecture Enhancements to Support Heterogeneous Partially Reconfigurable Regions FPGA架构增强以支持异构部分可重构区域
Christophe Huriaux, O. Sentieys, R. Tessier
In this work the author develop an FPGA architecture which allows for the placement of a partial FPGA design on the logic fabric even if the relative placement of heterogeneous blocks within the target region is not identical to the placement used to generate the bitstream for the partial design. This work has been conducted in the context of the European FP7 FlexTiles project in which a dynamically reconfigurable logic fabric is embedded in a 3-D stacked chip along with a manycore architecture. The reconfigurable logic fabric is used to load hardware-accelerated functions whose use is scheduled at run time. All communication between the fabric and manycore is made via dedicated I/O interface blocks in the fabric. This communication configuration increases the need for a flexible architecture which can handle the placement of a single application bitstream in multiple locations on the logic fabric.
在这项工作中,作者开发了一种FPGA架构,允许在逻辑结构上放置部分FPGA设计,即使目标区域内异构块的相对放置与用于生成部分设计的比特流的放置不相同。这项工作是在欧洲FP7 FlexTiles项目的背景下进行的,该项目将动态可重构的逻辑结构嵌入到具有多核架构的3-D堆叠芯片中。可重新配置的逻辑结构用于加载硬件加速功能,这些功能的使用是在运行时安排的。fabric和多核之间的所有通信都是通过fabric中的专用I/O接口块进行的。这种通信配置增加了对灵活架构的需求,该架构可以处理将单个应用程序位流放置在逻辑结构上的多个位置。
{"title":"FPGA Architecture Enhancements to Support Heterogeneous Partially Reconfigurable Regions","authors":"Christophe Huriaux, O. Sentieys, R. Tessier","doi":"10.1109/FCCM.2014.17","DOIUrl":"https://doi.org/10.1109/FCCM.2014.17","url":null,"abstract":"In this work the author develop an FPGA architecture which allows for the placement of a partial FPGA design on the logic fabric even if the relative placement of heterogeneous blocks within the target region is not identical to the placement used to generate the bitstream for the partial design. This work has been conducted in the context of the European FP7 FlexTiles project in which a dynamically reconfigurable logic fabric is embedded in a 3-D stacked chip along with a manycore architecture. The reconfigurable logic fabric is used to load hardware-accelerated functions whose use is scheduled at run time. All communication between the fabric and manycore is made via dedicated I/O interface blocks in the fabric. This communication configuration increases the need for a flexible architecture which can handle the placement of a single application bitstream in multiple locations on the logic fabric.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132233754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Performance Comparison between Multi-FPGA Prototyping Platforms: Hardwired Off-the-Shelf, Cabling, and Custom 多fpga原型平台之间的性能比较:硬连线现成,布线和自定义
Qingshan Tang, M. Tuna, H. Mehrez
We can classify multi-FPGA prototyping platforms in three categories: hardwired off-the-shelf, cabling and custom. Three points are developed in this paper. Firstly, an automatic design flow is proposed to generate a cabling platform and a custom platform for a given design. Then, the optimal width of cables for a cabling multi-FPGA platform is explored. Finally, the performances of these three multi-FPGA platforms are compared. The results show that the cabling platform achieves up to 82% gain in performance, and the custom platform achieves up to 100%, compared to the hardwired off-the-shelf platform. The custom platform achieves up to 20% gain in performance over the cabling platform. Therefore the results show that, apart from some stringent constraints (such as deployment cost or specific frequency needed), the relatively new cabling paradigm with the proposed automatic, inter-FPGA tracks distribution tool, offers an attractive alternative compared to the two other platforms.
我们可以将多fpga原型平台分为三类:硬连线现成,布线和自定义。本文主要阐述了三点。首先,提出了一种自动设计流程,用于生成给定设计的布线平台和自定义平台。然后,探讨了多fpga布线平台的最优电缆宽度。最后,比较了这三种多fpga平台的性能。结果表明,与硬件连接的现成平台相比,布线平台的性能提高了82%,定制平台的性能提高了100%。与布线平台相比,定制平台的性能可提高20%。因此,结果表明,除了一些严格的限制(如部署成本或所需的特定频率)之外,与其他两个平台相比,相对较新的布线范例与所提出的自动fpga间轨道分布工具提供了一个有吸引力的替代方案。
{"title":"Performance Comparison between Multi-FPGA Prototyping Platforms: Hardwired Off-the-Shelf, Cabling, and Custom","authors":"Qingshan Tang, M. Tuna, H. Mehrez","doi":"10.1109/FCCM.2014.44","DOIUrl":"https://doi.org/10.1109/FCCM.2014.44","url":null,"abstract":"We can classify multi-FPGA prototyping platforms in three categories: hardwired off-the-shelf, cabling and custom. Three points are developed in this paper. Firstly, an automatic design flow is proposed to generate a cabling platform and a custom platform for a given design. Then, the optimal width of cables for a cabling multi-FPGA platform is explored. Finally, the performances of these three multi-FPGA platforms are compared. The results show that the cabling platform achieves up to 82% gain in performance, and the custom platform achieves up to 100%, compared to the hardwired off-the-shelf platform. The custom platform achieves up to 20% gain in performance over the cabling platform. Therefore the results show that, apart from some stringent constraints (such as deployment cost or specific frequency needed), the relatively new cabling paradigm with the proposed automatic, inter-FPGA tracks distribution tool, offers an attractive alternative compared to the two other platforms.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130598055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Breaking Sequential Dependencies in FPGA-Based Sparse LU Factorization 基于fpga的稀疏LU分解中的顺序依赖分解
Siddhartha, Nachiket Kapre
Substitution, and reassociation of irregular sparse LU factorization can deliver up to 31% additional speedup over an existing state-of-the-art parallel FPGA implementation where further parallelization was deemed virtually impossible. The state-of-the-art implementation is already capable of delivering 3× acceleration over CPU-based sparse LU solvers. Sparse LU factorization is a well-known computational bottleneck in many existing scientific and engineering applications and is notoriously hard to parallelize due to inherent sequential dependencies in the computation graph. In this paper, we show how to break these alleged inherent dependencies using depth-limited substitution, and reassociation of the resulting computation. This is a work-parallelism tradeoff that is well-suited for implementation on FPGA-based token dataflow architectures. Such compute organizations are capable of fast parallel processing of large irregular graphs extracted from the sparse LU computation. We manage and control the growth in additional work due to substitution through careful selection of substitution depth. We exploit associativity in the generated graphs to restructure long compute chains into reduction trees.
替换和重新关联不规则稀疏LU分解可以比现有的最先进的并行FPGA实现提供高达31%的额外加速,而进一步的并行化实际上是不可能的。最先进的实现已经能够在基于cpu的稀疏LU解算器上提供3倍的加速。在许多现有的科学和工程应用中,稀疏LU分解是一个众所周知的计算瓶颈,并且由于计算图中固有的顺序依赖性而难以并行化。在本文中,我们展示了如何使用深度限制替代来打破这些所谓的固有依赖,并重新关联结果计算。这是一种工作并行性的权衡,非常适合在基于fpga的令牌数据流架构上实现。这种计算组织能够快速并行处理从稀疏LU计算中提取的大型不规则图。我们通过对替代深度的精心选择,来管理和控制替代带来的额外工作量的增长。我们利用生成图中的结合性将长计算链重构为约简树。
{"title":"Breaking Sequential Dependencies in FPGA-Based Sparse LU Factorization","authors":"Siddhartha, Nachiket Kapre","doi":"10.1109/FCCM.2014.26","DOIUrl":"https://doi.org/10.1109/FCCM.2014.26","url":null,"abstract":"Substitution, and reassociation of irregular sparse LU factorization can deliver up to 31% additional speedup over an existing state-of-the-art parallel FPGA implementation where further parallelization was deemed virtually impossible. The state-of-the-art implementation is already capable of delivering 3× acceleration over CPU-based sparse LU solvers. Sparse LU factorization is a well-known computational bottleneck in many existing scientific and engineering applications and is notoriously hard to parallelize due to inherent sequential dependencies in the computation graph. In this paper, we show how to break these alleged inherent dependencies using depth-limited substitution, and reassociation of the resulting computation. This is a work-parallelism tradeoff that is well-suited for implementation on FPGA-based token dataflow architectures. Such compute organizations are capable of fast parallel processing of large irregular graphs extracted from the sparse LU computation. We manage and control the growth in additional work due to substitution through careful selection of substitution depth. We exploit associativity in the generated graphs to restructure long compute chains into reduction trees.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122602143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories LEAP共享存储器:FPGA相干存储器的自动化构建
Hsin-Jung Yang, Kermin Fleming, Michael Adler, J. Emer
Parallel programming has been widely used in many scientific and technical areas to solve large problems. While general-purpose processors have rich infrastructure to support parallel programming on shared memory, such as coherent caches and synchronization libraries, parallel programming infrastructure for FPGAs is limited. Thus, development of FPGA-based parallel algorithms remains difficult. In this work, we seek to simplify parallel programming on FPGAs. We provide a set of easy-to-use declarative primitives to maintain coherency and consistency of accesses to shared memory resources. We propose a shared-memory service that automatically manages coherent caches on multiple FPGAs. Experimental results of a 2-dimensional heat transfer equation show that the shared memory service with our distributed coherent caches outperforms a centralized cache by 2.6x. To handle synchronization, we provide new lock and barrier primitives that leverage native FPGA communication capabilities and outperform traditional through-memory primitives by 1.8x.
并行编程已广泛应用于许多科学技术领域,以解决大型问题。虽然通用处理器有丰富的基础设施来支持共享内存上的并行编程,如一致缓存和同步库,但fpga的并行编程基础设施是有限的。因此,基于fpga的并行算法的开发仍然是困难的。在这项工作中,我们试图简化fpga上的并行编程。我们提供了一组易于使用的声明性原语来维护对共享内存资源访问的一致性和一致性。我们提出了一种自动管理多个fpga上的一致缓存的共享内存服务。二维热传递方程的实验结果表明,采用分布式相干缓存的共享内存服务的性能比集中式缓存高2.6倍。为了处理同步,我们提供了新的锁和屏障原语,它们利用了本地FPGA通信能力,性能比传统的内存原语高出1.8倍。
{"title":"LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories","authors":"Hsin-Jung Yang, Kermin Fleming, Michael Adler, J. Emer","doi":"10.1109/FCCM.2014.43","DOIUrl":"https://doi.org/10.1109/FCCM.2014.43","url":null,"abstract":"Parallel programming has been widely used in many scientific and technical areas to solve large problems. While general-purpose processors have rich infrastructure to support parallel programming on shared memory, such as coherent caches and synchronization libraries, parallel programming infrastructure for FPGAs is limited. Thus, development of FPGA-based parallel algorithms remains difficult. In this work, we seek to simplify parallel programming on FPGAs. We provide a set of easy-to-use declarative primitives to maintain coherency and consistency of accesses to shared memory resources. We propose a shared-memory service that automatically manages coherent caches on multiple FPGAs. Experimental results of a 2-dimensional heat transfer equation show that the shared memory service with our distributed coherent caches outperforms a centralized cache by 2.6x. To handle synchronization, we provide new lock and barrier primitives that leverage native FPGA communication capabilities and outperform traditional through-memory primitives by 1.8x.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125130850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Accurate and Efficient Three Level Design Space Exploration Based on Constraints Satisfaction Optimization Problem Solver 基于约束满足优化问题求解器的精确高效三级设计空间探索
Shuo Li, A. Hemani
In this paper, we propose an efficient and effective there level Design Space Exploration (DSE) method for mapping a system consisting of a number of DSP functions onto an RTL or lower level model using constraint programming methodology. The design space has three dimensions: a) function execution schedule (when the functions should execute), b) function implementation assignment (how the execution of functions are assigned to physical kernels) and c) implementation architecture (how many arithmetic units are deployed in each kernel). The DSE has been formulated as a Constraints Satisfaction Optimization Problem (CSOP) and solved by the constraint programming solver in Google's OR-Tools.
在本文中,我们提出了一种高效和有效的二级设计空间探索(DSE)方法,用于使用约束规划方法将由许多DSP功能组成的系统映射到RTL或更低级别模型上。设计空间有三个维度:a)函数执行计划(函数应该何时执行),b)函数实现分配(如何将函数的执行分配给物理内核)和c)实现体系结构(每个内核中部署了多少算术单元)。将该问题表述为约束满足优化问题(CSOP),并利用Google OR-Tools中的约束规划求解器进行求解。
{"title":"Accurate and Efficient Three Level Design Space Exploration Based on Constraints Satisfaction Optimization Problem Solver","authors":"Shuo Li, A. Hemani","doi":"10.1109/FCCM.2014.56","DOIUrl":"https://doi.org/10.1109/FCCM.2014.56","url":null,"abstract":"In this paper, we propose an efficient and effective there level Design Space Exploration (DSE) method for mapping a system consisting of a number of DSP functions onto an RTL or lower level model using constraint programming methodology. The design space has three dimensions: a) function execution schedule (when the functions should execute), b) function implementation assignment (how the execution of functions are assigned to physical kernels) and c) implementation architecture (how many arithmetic units are deployed in each kernel). The DSE has been formulated as a Constraints Satisfaction Optimization Problem (CSOP) and solved by the constraint programming solver in Google's OR-Tools.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129835008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reducing Processing Latency with a Heterogeneous FPGA-Processor Framework 利用异构fpga处理器框架减少处理延迟
Jonathon Pendlum, M. Leeser, K. Chowdhury
Both Xilinx and Altera have released SoCs that tightly couple programmable logic with a dual core Cortex A9 ARM processor. These SoCs show promise in accelerating applications that exploit both the FPGA's parallel processing architecture and the CPU's sequential processing. For example, before accessing a wireless channel, a cognitive radio does spectrum sensing to detect channel occupancy and then makes a decision based on spectrum policies. Spectrum sensing maps well to FPGA fabric, while spectrum decision can be implemented with a CPU. Both algorithms are highly sensitive to latency as a faster decision improves spectrum utilization. This paper introduces CRASH: Cognitive Radio Accelerated with Software and Hardware - a new software and programmable logic framework for Xilinx's Zynq SoC targeting cognitive radio. We implement spectrum sensing and the spectrum decision in three configurations: both algorithms in the FPGA, both in software only, and spectrum sensing on the FPGA and spectrum decision on the CPU. We measure the end-to-end latency to detect and acquire unoccupied spectrum for these configurations. Results show that CRASH can successfully partition algorithms between FPGA and CPU and reduce processing latency.
赛灵思和Altera都发布了可编程逻辑与双核Cortex A9 ARM处理器紧密结合的soc。这些soc在加速利用FPGA的并行处理架构和CPU的顺序处理的应用中表现出了希望。例如,在进入无线信道之前,认知无线电进行频谱感知以检测信道占用情况,然后根据频谱策略做出决策。频谱感知可以很好地映射到FPGA结构,而频谱决策可以用CPU实现。这两种算法都对延迟高度敏感,因为更快的决策提高了频谱利用率。本文介绍了CRASH: Cognitive Radio Accelerated with Software and Hardware——一种针对赛灵思Zynq SoC的全新软件和可编程逻辑框架。我们在三种配置中实现频谱感知和频谱决策:两种算法都在FPGA中实现,两种算法都在软件中实现,频谱感知在FPGA上实现,频谱决策在CPU上实现。我们测量端到端延迟来检测和获取这些配置的未占用频谱。结果表明,CRASH可以成功地在FPGA和CPU之间划分算法,降低处理延迟。
{"title":"Reducing Processing Latency with a Heterogeneous FPGA-Processor Framework","authors":"Jonathon Pendlum, M. Leeser, K. Chowdhury","doi":"10.1109/FCCM.2014.13","DOIUrl":"https://doi.org/10.1109/FCCM.2014.13","url":null,"abstract":"Both Xilinx and Altera have released SoCs that tightly couple programmable logic with a dual core Cortex A9 ARM processor. These SoCs show promise in accelerating applications that exploit both the FPGA's parallel processing architecture and the CPU's sequential processing. For example, before accessing a wireless channel, a cognitive radio does spectrum sensing to detect channel occupancy and then makes a decision based on spectrum policies. Spectrum sensing maps well to FPGA fabric, while spectrum decision can be implemented with a CPU. Both algorithms are highly sensitive to latency as a faster decision improves spectrum utilization. This paper introduces CRASH: Cognitive Radio Accelerated with Software and Hardware - a new software and programmable logic framework for Xilinx's Zynq SoC targeting cognitive radio. We implement spectrum sensing and the spectrum decision in three configurations: both algorithms in the FPGA, both in software only, and spectrum sensing on the FPGA and spectrum decision on the CPU. We measure the end-to-end latency to detect and acquire unoccupied spectrum for these configurations. Results show that CRASH can successfully partition algorithms between FPGA and CPU and reduce processing latency.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129328107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
SMCGen: Generating Reconfigurable Design for Sequential Monte Carlo Applications 生成时序蒙特卡罗应用的可重构设计
T. Chau, Maciej Kurek, James Stanley Targett, J. Humphrey, Georgios Skouroupathis, A. Eele, J. Maciejowski, Benjamin Cope, Kathryn Cobden, P. Leong, P. Cheung, W. Luk
The Sequential Monte Carlo (SMC) method is a simulation-based approach to compute posterior distributions. SMC methods often work well on applications considered intractable by other methods due to high dimensionality, but they are computationally demanding. While SMC has been implemented efficiently on FPGAs, design productivity remains a challenge. This paper introduces a design flow for generating efficient implementation of reconfigurable SMC designs. Through templating the SMC structure, the design flow enables efficient mapping of SMC applications to multiple FPGAs. The proposed design flow consists of a parametrisable SMC computation engine, and an open-source software template which enables efficient mapping of a variety of SMC designs to reconfigurable hardware. Design parameters that are critical to the performance and to the solution quality are tuned using a machine learning algorithm based on surrogate modelling. Experimental results for three case studies show that design performance is substantially improved after parameter optimisation. The proposed design flow demonstrates its capability of producing reconfigurable implementations for a range of SMC applications that have significant improvement in speed and in energy efficiency over optimised CPU and GPU implementations.
序贯蒙特卡罗(SMC)方法是一种基于模拟的计算后验分布的方法。SMC方法通常可以很好地解决由于高维数而被其他方法认为难以解决的应用,但它们的计算量很高。虽然SMC已经在fpga上高效实现,但设计效率仍然是一个挑战。本文介绍了生成可重构SMC设计的有效实现的设计流程。通过模版SMC结构,设计流程可以将SMC应用程序有效地映射到多个fpga。提出的设计流程包括一个可参数化的SMC计算引擎和一个开源软件模板,该软件模板可以将各种SMC设计有效地映射到可重构的硬件。对性能和解决方案质量至关重要的设计参数使用基于代理建模的机器学习算法进行调整。三个实例的实验结果表明,参数优化后的设计性能有了很大提高。所提出的设计流程证明了其为一系列SMC应用程序生产可重构实现的能力,这些应用程序在优化CPU和GPU实现的速度和能效方面有显着提高。
{"title":"SMCGen: Generating Reconfigurable Design for Sequential Monte Carlo Applications","authors":"T. Chau, Maciej Kurek, James Stanley Targett, J. Humphrey, Georgios Skouroupathis, A. Eele, J. Maciejowski, Benjamin Cope, Kathryn Cobden, P. Leong, P. Cheung, W. Luk","doi":"10.1109/FCCM.2014.46","DOIUrl":"https://doi.org/10.1109/FCCM.2014.46","url":null,"abstract":"The Sequential Monte Carlo (SMC) method is a simulation-based approach to compute posterior distributions. SMC methods often work well on applications considered intractable by other methods due to high dimensionality, but they are computationally demanding. While SMC has been implemented efficiently on FPGAs, design productivity remains a challenge. This paper introduces a design flow for generating efficient implementation of reconfigurable SMC designs. Through templating the SMC structure, the design flow enables efficient mapping of SMC applications to multiple FPGAs. The proposed design flow consists of a parametrisable SMC computation engine, and an open-source software template which enables efficient mapping of a variety of SMC designs to reconfigurable hardware. Design parameters that are critical to the performance and to the solution quality are tuned using a machine learning algorithm based on surrogate modelling. Experimental results for three case studies show that design performance is substantially improved after parameter optimisation. The proposed design flow demonstrates its capability of producing reconfigurable implementations for a range of SMC applications that have significant improvement in speed and in energy efficiency over optimised CPU and GPU implementations.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129857030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1