首页 > 最新文献

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)最新文献

英文 中文
Breeze computing: A just in time (JIT) approach for virtualizing FPGAs in the cloud 微风计算:一种在云端虚拟化fpga的即时(JIT)方法
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857159
Sen Ma, D. Andrews, Shanyuan Gao, Jaime Cummins
In this paper, we introduce a new design flow and architecture that lets programmers replace synthesis with compilation to create custom accelerators within data center and warehouse scale computers that include reconfigurable many core architectures. Within our new approach, we virtualize FPGAs into pre-defined partially reconfigurable tiles. We then define a run time interpreter that assembles bit stream versions of programming patterns into the tiles. The bit streams as well as software executables are maintained within libraries accessed by the application programmers. In our approach, synthesis occurs hand in hand with the initial coding of the software programming patterns when a Domain Specific Language is first created for the application programmers. Initial results show the approach allows hardware accelerators to be compiled 100x faster compared to the time required to synthesize the same functionality. Initial performance results further show a compilation/interpretation approach can achieve approximately equivalent performance for matrix operations and filtering compared to synthesizing a custom accelerator.
在本文中,我们介绍了一种新的设计流程和体系结构,它允许程序员用编译代替合成,从而在数据中心和仓库规模的计算机中创建自定义加速器,这些计算机包括可重构的许多核心体系结构。在我们的新方法中,我们将fpga虚拟化为预定义的部分可重构块。然后,我们定义一个运行时解释器,它将编程模式的位流版本汇编到tile中。位流和软件可执行文件都保存在由应用程序程序员访问的库中。在我们的方法中,当首次为应用程序编程人员创建领域特定语言时,合成与软件编程模式的初始编码一起发生。初步结果表明,与合成相同功能所需的时间相比,该方法使硬件加速器的编译速度提高了100倍。初始性能结果进一步表明,与合成自定义加速器相比,编译/解释方法在矩阵操作和过滤方面可以获得大致相同的性能。
{"title":"Breeze computing: A just in time (JIT) approach for virtualizing FPGAs in the cloud","authors":"Sen Ma, D. Andrews, Shanyuan Gao, Jaime Cummins","doi":"10.1109/ReConFig.2016.7857159","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857159","url":null,"abstract":"In this paper, we introduce a new design flow and architecture that lets programmers replace synthesis with compilation to create custom accelerators within data center and warehouse scale computers that include reconfigurable many core architectures. Within our new approach, we virtualize FPGAs into pre-defined partially reconfigurable tiles. We then define a run time interpreter that assembles bit stream versions of programming patterns into the tiles. The bit streams as well as software executables are maintained within libraries accessed by the application programmers. In our approach, synthesis occurs hand in hand with the initial coding of the software programming patterns when a Domain Specific Language is first created for the application programmers. Initial results show the approach allows hardware accelerators to be compiled 100x faster compared to the time required to synthesize the same functionality. Initial performance results further show a compilation/interpretation approach can achieve approximately equivalent performance for matrix operations and filtering compared to synthesizing a custom accelerator.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116431535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Hobbit — Smaller but faster than a dwarf: Revisiting lightweight SHA-3 FPGA implementations 霍比特人-比矮人小但比矮人快:重新审视轻量级SHA-3 FPGA实现
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857176
Bernhard Jungk, Marc Stöttinger
In this paper, we revisit lightweight FPGA implementations for SHA-3 and improve upon the state of the art by applying a new optimization technique to the slice-oriented architecture, which is based on a shallow pipeline. As a result, the area for the implementation reduces by almost one quarter (23%), compared to the up to now smallest implementation for Virtex-5 FPGAs. The proposed design also improves on the throughput-area ratio by 59%. For Virtex-6 FPGAs, the improvements are even higher, showing a throughput-area ratio increase by over 150% upon previously reported results for this FPGA. Furthermore, we evaluate several additional implementation trade-offs. First, we provide the maximum number of pipeline stages for lightweight architectures, which process several slices in parallel and for variants of SHA-3 with only 800 and 400 bits of internal state. Second, we evaluate several hardware interfaces. This evaluation shows, that the hardware interface may have a significant impact on the area consumption and the throughput.
在本文中,我们重新审视了SHA-3的轻量级FPGA实现,并通过将一种新的优化技术应用于基于浅管道的面向切片的体系结构来改进目前的技术水平。因此,与目前最小的Virtex-5 fpga实现相比,实现面积减少了近四分之一(23%)。提出的设计还提高了59%的吞吐量面积比。对于Virtex-6 FPGA,改进甚至更高,显示出该FPGA的吞吐量面积比比先前报道的结果增加了150%以上。此外,我们还评估了几个额外的实现权衡。首先,我们为轻量级架构提供了最大数量的管道阶段,轻量级架构可以并行处理多个切片,而SHA-3的变体只有800和400位的内部状态。其次,我们评估了几个硬件接口。该评估表明,硬件接口可能会对面积消耗和吞吐量产生重大影响。
{"title":"Hobbit — Smaller but faster than a dwarf: Revisiting lightweight SHA-3 FPGA implementations","authors":"Bernhard Jungk, Marc Stöttinger","doi":"10.1109/ReConFig.2016.7857176","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857176","url":null,"abstract":"In this paper, we revisit lightweight FPGA implementations for SHA-3 and improve upon the state of the art by applying a new optimization technique to the slice-oriented architecture, which is based on a shallow pipeline. As a result, the area for the implementation reduces by almost one quarter (23%), compared to the up to now smallest implementation for Virtex-5 FPGAs. The proposed design also improves on the throughput-area ratio by 59%. For Virtex-6 FPGAs, the improvements are even higher, showing a throughput-area ratio increase by over 150% upon previously reported results for this FPGA. Furthermore, we evaluate several additional implementation trade-offs. First, we provide the maximum number of pipeline stages for lightweight architectures, which process several slices in parallel and for variants of SHA-3 with only 800 and 400 bits of internal state. Second, we evaluate several hardware interfaces. This evaluation shows, that the hardware interface may have a significant impact on the area consumption and the throughput.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114581675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms 异构计算平台上加速BWA-MEM实现的能效分析
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857181
Ernst Houtgast, V. Sima, G. Marchiori, K. Bertels, Z. Al-Ars
Next Generation Sequencing techniques have dramatically reduced the cost of sequencing genetic material, resulting in huge amounts of data being sequenced. The processing of this data poses huge challenges, both from a performance perspective, as well as from a power-efficiency perspective. Heterogeneous computing can help on both fronts, by enabling more performant and more power-efficient solutions. In this paper, power-efficiency of the BWA-MEM algorithm, a popular tool for genomic data mapping, is studied on two heterogeneous architectures. The performance and power-efficiency of an FPGA-based implementation using a single Xilinx Virtex-7 FPGA on the Alpha Data add-in card is compared to a GPU-based implementation using an NVIDIA GeForce GTX 970 and against the software-only baseline system. By offloading the Seed Extension phase on an accelerator, both implementations are able to achieve a two-fold speedup in overall application-level performance over the software-only implementation. Moreover, the highly customizable nature of the FPGA results in much higher power-efficiency, as the FPGA power consumption is less than one fourth of that of the GPU. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 44 thousand base pairs per Joule, a 2.1x gain in power-efficiency as compared to the software-only baseline.
下一代测序技术大大降低了测序遗传物质的成本,导致大量数据被测序。这些数据的处理带来了巨大的挑战,无论是从性能的角度,还是从能效的角度。异构计算可以通过支持更高性能和更节能的解决方案在这两个方面提供帮助。本文在两种异构架构下研究了基因组数据映射的常用工具BWA-MEM算法的功率效率。在Alpha Data附加卡上使用单个Xilinx Virtex-7 FPGA的基于FPGA的实现与使用NVIDIA GeForce GTX 970的基于gpu的实现以及仅软件基线系统的性能和功耗进行了比较。通过在加速器上卸载Seed Extension阶段,两种实现都能够在整体应用程序级性能上实现比纯软件实现两倍的加速。此外,FPGA的高度可定制特性导致更高的功率效率,因为FPGA的功耗不到GPU的四分之一。为了方便与平台和工具无关的比较,引入了每焦耳单位的碱基对作为功率效率的度量。FPGA设计能够映射高达每焦耳4.4万个碱基对,与仅软件基准相比,功率效率提高2.1倍。
{"title":"Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms","authors":"Ernst Houtgast, V. Sima, G. Marchiori, K. Bertels, Z. Al-Ars","doi":"10.1109/ReConFig.2016.7857181","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857181","url":null,"abstract":"Next Generation Sequencing techniques have dramatically reduced the cost of sequencing genetic material, resulting in huge amounts of data being sequenced. The processing of this data poses huge challenges, both from a performance perspective, as well as from a power-efficiency perspective. Heterogeneous computing can help on both fronts, by enabling more performant and more power-efficient solutions. In this paper, power-efficiency of the BWA-MEM algorithm, a popular tool for genomic data mapping, is studied on two heterogeneous architectures. The performance and power-efficiency of an FPGA-based implementation using a single Xilinx Virtex-7 FPGA on the Alpha Data add-in card is compared to a GPU-based implementation using an NVIDIA GeForce GTX 970 and against the software-only baseline system. By offloading the Seed Extension phase on an accelerator, both implementations are able to achieve a two-fold speedup in overall application-level performance over the software-only implementation. Moreover, the highly customizable nature of the FPGA results in much higher power-efficiency, as the FPGA power consumption is less than one fourth of that of the GPU. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 44 thousand base pairs per Joule, a 2.1x gain in power-efficiency as compared to the software-only baseline.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131671640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
ARM+FPGA platform to manage solid-state-smart transformer in smart grid application ARM+FPGA平台管理智能电网应用中的固态智能变压器
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857155
N. Nila-Olmedo, F. Mendoza-Mondragón, A. Espinosa-Calderón, Moreno
This paper proposes a suitable digital platform based on the Xilinx Zynq®-7000 family to process the functions performed by a smart-solid-state-transformer (3ST) in smart grids (SG) application. These functions include: link to information and communication technologies (ICT), voltage transformation, integration of distributed renewable energy resources (DRER) and distributed energy storage devices (DESD). The Zynq platform embeds a double-core ARM® Cortex™-A9 processor and Field Programmable Gates Array (FPGA) technology, all within a programmable system on a chip (SoC). The main advantages of this technology are modularity, scalability, quick and easy maintenance and low-costs. Experimental results are included to show some capabilities of the proposed platform in a 3ST laboratory test-bed.
本文提出了一种基于Xilinx Zynq®-7000系列的合适数字平台,用于处理智能电网(SG)应用中智能固态变压器(3ST)所执行的功能。这些功能包括:连接信息和通信技术(ICT)、电压转换、分布式可再生能源(DRER)和分布式储能设备(DESD)的集成。Zynq平台嵌入了双核ARM®Cortex™-A9处理器和现场可编程门阵列(FPGA)技术,所有这些都在片上可编程系统(SoC)中。该技术的主要优点是模块化、可扩展性强、维护方便快捷、成本低。实验结果显示了该平台在3ST实验室测试台上的一些功能。
{"title":"ARM+FPGA platform to manage solid-state-smart transformer in smart grid application","authors":"N. Nila-Olmedo, F. Mendoza-Mondragón, A. Espinosa-Calderón, Moreno","doi":"10.1109/ReConFig.2016.7857155","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857155","url":null,"abstract":"This paper proposes a suitable digital platform based on the Xilinx Zynq®-7000 family to process the functions performed by a smart-solid-state-transformer (3ST) in smart grids (SG) application. These functions include: link to information and communication technologies (ICT), voltage transformation, integration of distributed renewable energy resources (DRER) and distributed energy storage devices (DESD). The Zynq platform embeds a double-core ARM® Cortex™-A9 processor and Field Programmable Gates Array (FPGA) technology, all within a programmable system on a chip (SoC). The main advantages of this technology are modularity, scalability, quick and easy maintenance and low-costs. Experimental results are included to show some capabilities of the proposed platform in a 3ST laboratory test-bed.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123004689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Towards FPGA-assisted spark: An SVM training acceleration case study fpga辅助火花:SVM训练加速案例研究
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857194
S. M. H. Ho, Maolin Wang, Ho-Cheung Ng, Hayden Kwok-Hay So
A system that augments the Apache Spark data processing framework with FPGA accelerators is presented as a way to model and deploy FPGA-assisted applications in large-scale clusters. In our proposed framework, FPGAs can optionally be used in place of the host CPU for Resilient distributed datasets (RDDs) transformations, allowing seamless integration between gateware and software processing. Using the case of training an Support Vector Machine (SVM) cell image classifier as a case study, we explore the feasibilities, benefits and challenges of such technique. In our experiments where data communication between CPU and FPGA is tightly controlled, a consistent speedup of up to 1.6x can be achieved for the target SVM training application as the cluster size increases. Hardware-software techniques that are crucial to achieve acceleration such as the management of data partition size are explored. We demonstrate the benefits of the proposed framework, while also illustrate the importance of careful hardware-software management to avoid excessive CPU-FPGA communication that can quickly diminish the benefits of FPGA acceleration.
提出了一个用FPGA加速器增强Apache Spark数据处理框架的系统,作为在大规模集群中建模和部署FPGA辅助应用程序的一种方法。在我们提出的框架中,fpga可以选择性地代替主机CPU进行弹性分布式数据集(rdd)转换,从而允许网关软件和软件处理之间的无缝集成。以支持向量机(SVM)细胞图像分类器的训练为例,探讨了该技术的可行性、优势和挑战。在我们的实验中,CPU和FPGA之间的数据通信被严格控制,随着集群大小的增加,目标SVM训练应用程序可以实现高达1.6倍的一致加速。硬件软件技术是实现加速的关键,如数据分区大小的管理进行了探讨。我们展示了所提出的框架的好处,同时也说明了仔细的硬件软件管理的重要性,以避免过度的CPU-FPGA通信,这会迅速降低FPGA加速的好处。
{"title":"Towards FPGA-assisted spark: An SVM training acceleration case study","authors":"S. M. H. Ho, Maolin Wang, Ho-Cheung Ng, Hayden Kwok-Hay So","doi":"10.1109/ReConFig.2016.7857194","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857194","url":null,"abstract":"A system that augments the Apache Spark data processing framework with FPGA accelerators is presented as a way to model and deploy FPGA-assisted applications in large-scale clusters. In our proposed framework, FPGAs can optionally be used in place of the host CPU for Resilient distributed datasets (RDDs) transformations, allowing seamless integration between gateware and software processing. Using the case of training an Support Vector Machine (SVM) cell image classifier as a case study, we explore the feasibilities, benefits and challenges of such technique. In our experiments where data communication between CPU and FPGA is tightly controlled, a consistent speedup of up to 1.6x can be achieved for the target SVM training application as the cluster size increases. Hardware-software techniques that are crucial to achieve acceleration such as the management of data partition size are explored. We demonstrate the benefits of the proposed framework, while also illustrate the importance of careful hardware-software management to avoid excessive CPU-FPGA communication that can quickly diminish the benefits of FPGA acceleration.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126268889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Robust bitstream protection in FPGA-based systems through low-overhead obfuscation 通过低开销混淆在基于fpga的系统中实现健壮的比特流保护
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857187
Robert Karam, Tamzidul Hoque, S. Ray, M. Tehranipoor, S. Bhunia
Reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs), are being increasingly deployed in diverse application areas including automotive systems, critical infrastructures, and the emerging Internet of Things (IoT), to implement customized designs. However, securing FPGA-based designs against piracy, reverse engineering, and tampering is challenging, especially for systems that require remote upgrade. In many cases, existing solutions based on bit-stream encryption may not provide sufficient protection against these attacks. In this paper, we present a novel obfuscation approach for provably robust protection of FPGA bitstreams at low overhead that goes well beyond the protection offered by bitstream encryption. The approach works with existing FPGA architectures and synthesis flows, and can be used with encryption techniques, or by itself for power and area-constrained systems. It leverages “FPGA dark silicon” — unused resources within the configurable logic blocks — to efficiently obfuscate the true functionality. We provide a detailed threat model and security analysis for the approach. We have developed a complete application mapping framework that integrates with the Altera Quartus II software. Using this CAD framework, we achieve provably strong security against all major attacks on FPGA bitstreams with an average 13% latency and 2% total power overhead for a set of benchmark circuits, as well as several large-scale open-source IP blocks on commercial FPGA.
现场可编程门阵列(fpga)等可重构硬件正越来越多地部署在各种应用领域,包括汽车系统、关键基础设施和新兴的物联网(IoT),以实现定制设计。然而,保护基于fpga的设计免受盗版,逆向工程和篡改是具有挑战性的,特别是对于需要远程升级的系统。在许多情况下,基于位流加密的现有解决方案可能无法提供足够的保护以抵御这些攻击。在本文中,我们提出了一种新的混淆方法,用于在低开销下对FPGA比特流进行可证明的鲁棒保护,远远超出了比特流加密所提供的保护。该方法适用于现有的FPGA架构和合成流程,可以与加密技术一起使用,也可以单独用于功率和面积受限的系统。它利用“FPGA暗硅”——可配置逻辑块中未使用的资源——有效地混淆了真正的功能。我们为该方法提供了详细的威胁模型和安全分析。我们已经开发了一个完整的应用程序映射框架,集成了Altera Quartus II软件。使用该CAD框架,我们实现了可证明的强大安全性,可以抵御FPGA比特流的所有主要攻击,一组基准电路的平均延迟为13%,总功耗开销为2%,以及商用FPGA上的几个大规模开源IP块。
{"title":"Robust bitstream protection in FPGA-based systems through low-overhead obfuscation","authors":"Robert Karam, Tamzidul Hoque, S. Ray, M. Tehranipoor, S. Bhunia","doi":"10.1109/ReConFig.2016.7857187","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857187","url":null,"abstract":"Reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs), are being increasingly deployed in diverse application areas including automotive systems, critical infrastructures, and the emerging Internet of Things (IoT), to implement customized designs. However, securing FPGA-based designs against piracy, reverse engineering, and tampering is challenging, especially for systems that require remote upgrade. In many cases, existing solutions based on bit-stream encryption may not provide sufficient protection against these attacks. In this paper, we present a novel obfuscation approach for provably robust protection of FPGA bitstreams at low overhead that goes well beyond the protection offered by bitstream encryption. The approach works with existing FPGA architectures and synthesis flows, and can be used with encryption techniques, or by itself for power and area-constrained systems. It leverages “FPGA dark silicon” — unused resources within the configurable logic blocks — to efficiently obfuscate the true functionality. We provide a detailed threat model and security analysis for the approach. We have developed a complete application mapping framework that integrates with the Altera Quartus II software. Using this CAD framework, we achieve provably strong security against all major attacks on FPGA bitstreams with an average 13% latency and 2% total power overhead for a set of benchmark circuits, as well as several large-scale open-source IP blocks on commercial FPGA.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117288606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
FPGA implementation of optimized XBM specifications by transformation for AFSMs 优化XBM规格的FPGA实现转换为AFSMs
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857171
Kledermon Garcia, D. L. Oliveira, R. d'Amore, L. Faria, J. L. V. Oliveira
The asynchronous paradigm is an alternative to digital system design because it eliminates the problems related to the clock signal, such as clock skew, clock distribution and power dissipation of the clock. An interesting style for asynchronous design, which is familiar to designers, divides the system in an asynchronous controller with synchronous datapath. A specification known as Extended Burst-Mode (XBM) is the most adequate one to describe the asynchronous controllers in this design style. The XBM specification must meet a number of properties to be implementable. A property known as the signal polarity may affect the controller performance. To satisfy the signal polarity, the designer must often introduce some state transitions that do not perform any operation, which are called in this paper as “dead transitions”. An XBM specification with dead transitions can reduce the controller performance. In this paper, we propose an algorithm that eliminates dead transitions in a XBM specification. This elimination occurs by transforming the original XBM specification, which leads to an optimization of the system performance. The algorithm was applied to seven well-known benchmarks and obtained a reduction of up to 37% in processing time.
异步模式是数字系统设计的一种替代方案,因为它消除了与时钟信号相关的问题,如时钟倾斜、时钟分布和时钟功耗。异步设计的一种有趣的风格,也是设计师所熟悉的,将系统划分为具有同步数据路径的异步控制器。称为扩展突发模式(XBM)的规范是描述这种设计风格中的异步控制器的最合适的规范。XBM规范必须满足许多属性才能实现。一种被称为信号极性的特性可能会影响控制器的性能。为了满足信号极性,设计者必须经常引入一些不执行任何操作的状态转换,本文称之为“死转换”。带有死转移的XBM规范会降低控制器的性能。在本文中,我们提出了一种消除XBM规范中的死转换的算法。这种消除是通过转换原始XBM规范来实现的,这将导致系统性能的优化。将该算法应用于七个著名的基准测试中,处理时间减少了37%。
{"title":"FPGA implementation of optimized XBM specifications by transformation for AFSMs","authors":"Kledermon Garcia, D. L. Oliveira, R. d'Amore, L. Faria, J. L. V. Oliveira","doi":"10.1109/ReConFig.2016.7857171","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857171","url":null,"abstract":"The asynchronous paradigm is an alternative to digital system design because it eliminates the problems related to the clock signal, such as clock skew, clock distribution and power dissipation of the clock. An interesting style for asynchronous design, which is familiar to designers, divides the system in an asynchronous controller with synchronous datapath. A specification known as Extended Burst-Mode (XBM) is the most adequate one to describe the asynchronous controllers in this design style. The XBM specification must meet a number of properties to be implementable. A property known as the signal polarity may affect the controller performance. To satisfy the signal polarity, the designer must often introduce some state transitions that do not perform any operation, which are called in this paper as “dead transitions”. An XBM specification with dead transitions can reduce the controller performance. In this paper, we propose an algorithm that eliminates dead transitions in a XBM specification. This elimination occurs by transforming the original XBM specification, which leads to an optimization of the system performance. The algorithm was applied to seven well-known benchmarks and obtained a reduction of up to 37% in processing time.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114868780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A multi-functional memory unit with PLA-based reconfigurable decoder 具有基于pla的可重构解码器的多功能存储单元
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857145
Nobuyuki Yahiro, Bo Liu, Atsushi Nanri, S. Nakatake, Y. Takashima, Gong Chen
An application-specific usage of memory is an important key in development of embedded systems for IoT devices. A functional memory unit such as content addressable memory (CAM) is a good solution for network-specific applications. This work proposes a novel functional memory unit which can reconfigure a function of the memory decoder. In our reconfigurable mechanism, uni-switch cells are introduced to play an alternative role of a logic or a wire, and are embedded in an SRAM memory array. A set of uni-switches is connected and constitutes a programmable logic array (PLA) unit. The PLA has a suitable advantage for a decoder that the multi-input and multi-output function can be realized with a small area, compared with look-up table (LUT). Hence, an extensional function of the decoder is realized by PLA units inside the memory array, and a combination of PLA units provides potentials to configure various functions for stored data such as sorting, filtering, error correction, and encryption/decryption. In this paper, we present a fundamental architecture of our functional memory unit with PLA units, and demonstrate an implementation of 32-bit full adder and 2-bit counter by using PLA units.
特定于应用程序的内存使用是物联网设备嵌入式系统开发的重要关键。功能存储器单元,如内容可寻址存储器(CAM),对于特定于网络的应用程序来说是一个很好的解决方案。本工作提出一种新的功能记忆单元,可以重新配置记忆解码器的功能。在我们的可重构机制中,单开关单元被引入来扮演逻辑或电线的替代角色,并嵌入在SRAM存储器阵列中。一组单开关连接起来,构成一个可编程逻辑阵列(PLA)单元。与查找表(LUT)相比,PLA具有在小面积下实现多输入多输出功能的优势。因此,解码器的扩展功能由存储阵列内的PLA单元实现,PLA单元的组合提供了为存储数据配置各种功能的潜力,例如排序、过滤、纠错和加密/解密。在本文中,我们提出了一种基于PLA单元的功能存储单元的基本架构,并演示了使用PLA单元实现32位全加法器和2位计数器。
{"title":"A multi-functional memory unit with PLA-based reconfigurable decoder","authors":"Nobuyuki Yahiro, Bo Liu, Atsushi Nanri, S. Nakatake, Y. Takashima, Gong Chen","doi":"10.1109/ReConFig.2016.7857145","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857145","url":null,"abstract":"An application-specific usage of memory is an important key in development of embedded systems for IoT devices. A functional memory unit such as content addressable memory (CAM) is a good solution for network-specific applications. This work proposes a novel functional memory unit which can reconfigure a function of the memory decoder. In our reconfigurable mechanism, uni-switch cells are introduced to play an alternative role of a logic or a wire, and are embedded in an SRAM memory array. A set of uni-switches is connected and constitutes a programmable logic array (PLA) unit. The PLA has a suitable advantage for a decoder that the multi-input and multi-output function can be realized with a small area, compared with look-up table (LUT). Hence, an extensional function of the decoder is realized by PLA units inside the memory array, and a combination of PLA units provides potentials to configure various functions for stored data such as sorting, filtering, error correction, and encryption/decryption. In this paper, we present a fundamental architecture of our functional memory unit with PLA units, and demonstrate an implementation of 32-bit full adder and 2-bit counter by using PLA units.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115613890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The R2-D2 toolchain — Automated porting of safety-critical applications to FPGAs R2-D2工具链-将安全关键应用程序自动移植到fpga
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857192
Steffen Vaas, M. Reichenbach, Ulrich Margull, D. Fey
Safety-critical applications require reliable hardware platforms with deterministic behavior. Concerning the increasing demand for performance, current single core solutions are not sufficient anymore. Classical multi-core processors are designed for a general application case, which provide much performance at the expense of determinism and reliability. In safety-critical applications, all required tasks are already known at development time. They are specified by a system description, like AUTOSAR. Thus, a hardware architecture providing one core for each task and one physical link for each data exchange between different tasks can be derived. However, such a highly application-specific architecture is not available. Latest FPGA technologies provide now enough resources to integrate several soft-core processors in one low-cost chip. Furthermore, the cores and their connections can be arranged flexibly in an FPGA. To bridge the gap between safety-critical applications and FPGAs, this approach provides a toolchain as addition to existing AUTOSAR design tools for automatically generating a specific hardware architecture from metadata of an AUTOSAR description. By reducing the complexity of the hardware platform drastically, a reconfigurable, reliable, deterministic, distributed (R2-D2) hardware architecture can be created. The results show that safety-critical tasks can be executed deterministically on one chip in parallel and multiple applications can be mapped to one low-cost FPGA. Furthermore, the latency of the system could be reduced extensively, so new application areas can be accessed.
安全关键型应用程序需要具有确定性行为的可靠硬件平台。对于日益增长的性能需求,目前的单核解决方案已经不能满足需求。经典的多核处理器是为一般的应用场景而设计的,它以牺牲确定性和可靠性为代价提供了大量的性能。在安全关键型应用程序中,所有必需的任务在开发时就已经知道了。它们由系统描述指定,如AUTOSAR。因此,可以推导出为每个任务提供一个核心和为不同任务之间的每个数据交换提供一个物理链路的硬件体系结构。然而,这种高度特定于应用程序的体系结构是不可用的。最新的FPGA技术提供了足够的资源,可以在一个低成本芯片中集成多个软核处理器。此外,内核及其连接可以灵活地安排在FPGA中。为了弥合安全关键应用和fpga之间的差距,这种方法提供了一个工具链,作为现有AUTOSAR设计工具的补充,可以从AUTOSAR描述的元数据自动生成特定的硬件架构。通过大幅降低硬件平台的复杂性,可以创建可重构的、可靠的、确定性的、分布式的(R2-D2)硬件架构。结果表明,安全关键任务可以在一个芯片上确定性地并行执行,多个应用可以映射到一个低成本的FPGA上。此外,系统的延迟可以大大降低,因此可以访问新的应用领域。
{"title":"The R2-D2 toolchain — Automated porting of safety-critical applications to FPGAs","authors":"Steffen Vaas, M. Reichenbach, Ulrich Margull, D. Fey","doi":"10.1109/ReConFig.2016.7857192","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857192","url":null,"abstract":"Safety-critical applications require reliable hardware platforms with deterministic behavior. Concerning the increasing demand for performance, current single core solutions are not sufficient anymore. Classical multi-core processors are designed for a general application case, which provide much performance at the expense of determinism and reliability. In safety-critical applications, all required tasks are already known at development time. They are specified by a system description, like AUTOSAR. Thus, a hardware architecture providing one core for each task and one physical link for each data exchange between different tasks can be derived. However, such a highly application-specific architecture is not available. Latest FPGA technologies provide now enough resources to integrate several soft-core processors in one low-cost chip. Furthermore, the cores and their connections can be arranged flexibly in an FPGA. To bridge the gap between safety-critical applications and FPGAs, this approach provides a toolchain as addition to existing AUTOSAR design tools for automatically generating a specific hardware architecture from metadata of an AUTOSAR description. By reducing the complexity of the hardware platform drastically, a reconfigurable, reliable, deterministic, distributed (R2-D2) hardware architecture can be created. The results show that safety-critical tasks can be executed deterministically on one chip in parallel and multiple applications can be mapped to one low-cost FPGA. Furthermore, the latency of the system could be reduced extensively, so new application areas can be accessed.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123234612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automating structured matrix-matrix multiplication for stream processing 自动化结构化矩阵-矩阵乘法流处理
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857158
Thaddeus Koehn, P. Athanas
Structured matrices in which at least one element is known to always be zero commonly appear in a variety of applications, including Markov processes, MIMO communications, and eigenvalue decomposition. Since matrices with known zeros require fewer computations, generating hardware to take advantage of this allows increased throughput. The approach in this paper can generate hardware for anything ranging from very sparse to completely full matrices. When dense (all elements non-zero) matrix multiplication hardware is generated, throughput is comparable to commercially available generators. As sparsity increases, throughput improves proportionally. This method also achieves a shorter processing delay compared with other techniques for sparse matrices.
已知至少有一个元素总是为零的结构化矩阵通常出现在各种应用中,包括马尔可夫过程、MIMO通信和特征值分解。由于已知零的矩阵需要较少的计算,因此生成利用这一点的硬件可以提高吞吐量。本文中的方法可以生成从非常稀疏到完全满矩阵的任何硬件。当生成密集(所有元素非零)矩阵乘法硬件时,吞吐量可与商用生成器相媲美。随着稀疏性的增加,吞吐量也相应提高。与其他稀疏矩阵处理技术相比,该方法具有更短的处理延迟。
{"title":"Automating structured matrix-matrix multiplication for stream processing","authors":"Thaddeus Koehn, P. Athanas","doi":"10.1109/ReConFig.2016.7857158","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857158","url":null,"abstract":"Structured matrices in which at least one element is known to always be zero commonly appear in a variety of applications, including Markov processes, MIMO communications, and eigenvalue decomposition. Since matrices with known zeros require fewer computations, generating hardware to take advantage of this allows increased throughput. The approach in this paper can generate hardware for anything ranging from very sparse to completely full matrices. When dense (all elements non-zero) matrix multiplication hardware is generated, throughput is comparable to commercially available generators. As sparsity increases, throughput improves proportionally. This method also achieves a shorter processing delay compared with other techniques for sparse matrices.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124537459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1