Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第6页

HexCell: a Hexagonal Cell for Evolvable Systolic Arrays on FPGAs: (Abstract Only) HexCell: fpga上可进化收缩阵列的六边形单元(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174988

F. Hussein, Luka Daoud, N. Rafla

This paper presents a novel cell architecture for evolvable systolic arrays. HexCell is a tile-able processing element with a hexagonal shape that can be implemented and dynamically reconfigured on field-programmable gate arrays (FPGAs). The cell contains a functional unit, three input ports, and three output ports. It supports two concurrent configuration schemes: dynamic partial reconfiguration (DPR), where the functional unit is partially reconfigured at run time, and virtual reconfiguration circuit (VRC), where the cell output port bypasses one of the input data or selects the functional unit output. Hence, HexCell combines the merits of DPR and VRC including resource-awareness, reconfiguration speed and routing flexibility. In addition, the cell structure supports pipelining and data synchronization for achieving high throughput for data-intensive applications like image processing. A HexCell is represented by a binary string (chromosome) that encodes the cell's function and the output selections. Our developed evolvable HexCell array supports more inputs and outputs, a variety of possible datapaths, and has faster reconfiguration, compared to the state-of-the-art systolic array while maintaining the same resource utilization. Moreover, by using the same genetic algorithm on the two systolic arrays, results show that the HexCell array has higher throughput and can evolve faster than state-of-the-art array.

本文提出了一种新的可进化收缩阵列细胞结构。HexCell是一个六角形的可平铺处理单元，可以在现场可编程门阵列(fpga)上实现和动态重新配置。该单元包含一个功能单元、三个输入端口和三个输出端口。它支持两种并发配置方案:动态部分重新配置(DPR)，其中功能单元在运行时部分重新配置，以及虚拟重新配置电路(VRC)，其中单元输出端口绕过其中一个输入数据或选择功能单元输出。因此，HexCell结合了DPR和VRC的优点，包括资源感知、重新配置速度和路由灵活性。此外，单元结构支持流水线和数据同步，以实现图像处理等数据密集型应用程序的高吞吐量。HexCell由二进制字符串(染色体)表示，该字符串编码单元格的功能和输出选择。与最先进的收缩阵列相比，我们开发的可进化的HexCell阵列支持更多的输入和输出，各种可能的数据路径，并且在保持相同资源利用率的同时具有更快的重新配置。此外，通过对两种收缩阵列使用相同的遗传算法，结果表明HexCell阵列具有更高的吞吐量，并且可以比最先进的阵列更快地进化。

{"title":"HexCell: a Hexagonal Cell for Evolvable Systolic Arrays on FPGAs: (Abstract Only)","authors":"F. Hussein, Luka Daoud, N. Rafla","doi":"10.1145/3174243.3174988","DOIUrl":"https://doi.org/10.1145/3174243.3174988","url":null,"abstract":"This paper presents a novel cell architecture for evolvable systolic arrays. HexCell is a tile-able processing element with a hexagonal shape that can be implemented and dynamically reconfigured on field-programmable gate arrays (FPGAs). The cell contains a functional unit, three input ports, and three output ports. It supports two concurrent configuration schemes: dynamic partial reconfiguration (DPR), where the functional unit is partially reconfigured at run time, and virtual reconfiguration circuit (VRC), where the cell output port bypasses one of the input data or selects the functional unit output. Hence, HexCell combines the merits of DPR and VRC including resource-awareness, reconfiguration speed and routing flexibility. In addition, the cell structure supports pipelining and data synchronization for achieving high throughput for data-intensive applications like image processing. A HexCell is represented by a binary string (chromosome) that encodes the cell's function and the output selections. Our developed evolvable HexCell array supports more inputs and outputs, a variety of possible datapaths, and has faster reconfiguration, compared to the state-of-the-art systolic array while maintaining the same resource utilization. Moreover, by using the same genetic algorithm on the two systolic arrays, results show that the HexCell array has higher throughput and can evolve faster than state-of-the-art array.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131024262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ParaDRo: A Parallel Deterministic Router Based on Spatial Partitioning and Scheduling 一种基于空间划分和调度的并行确定性路由器

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174246

Chin Hau Hoo, Akash Kumar

Routing of nets is one of the most time-consuming steps in the FPGA design flow. Existing works have described ways of accelerating the process through parallelization. However, only some of them are deterministic, and determinism is often achieved at the cost of speedup. In this paper, we propose ParaDRo, a parallel FPGA router based on spatial partitioning that achieves deterministic results while maintaining reasonable speedup. Existing spatial partitioning based routers do not scale well because the number of nets that can fully utilize all processors reduces as the number of processors increases. In addition, they route nets that are within a spatial partition sequentially. ParaDRo mitigates this problem by scheduling nets within a spatial partition to be routed in parallel if they do not have overlapping bounding boxes. Further parallelism is extracted by decomposing multi-sink nets into single-sink nets to minimize the amount of bounding box overlaps and increase the number of nets that can be routed in parallel. These improvements enable ParaDRo to achieve an average speedup of 5.4X with 8 threads with minimal impact on the quality of results.

网络路由是FPGA设计流程中最耗时的步骤之一。现有的工作已经描述了通过并行化加速进程的方法。然而，其中只有一部分是确定性的，而确定性往往是以牺牲加速为代价的。在本文中，我们提出了一种基于空间划分的并行FPGA路由器ParaDRo，它在保持合理加速的同时实现了确定性的结果。现有的基于空间分区的路由器不能很好地扩展，因为可以充分利用所有处理器的网络数量随着处理器数量的增加而减少。此外，它们按顺序路由空间分区内的网络。如果一个空间分区内的网络没有重叠的边界框，ParaDRo通过调度网络并行路由来缓解这个问题。通过将多汇网分解为单汇网来进一步提取并行性，以最小化边界框重叠的数量，并增加可以并行路由的网的数量。这些改进使ParaDRo能够在8个线程的情况下实现5.4倍的平均加速，同时对结果质量的影响最小。

{"title":"ParaDRo: A Parallel Deterministic Router Based on Spatial Partitioning and Scheduling","authors":"Chin Hau Hoo, Akash Kumar","doi":"10.1145/3174243.3174246","DOIUrl":"https://doi.org/10.1145/3174243.3174246","url":null,"abstract":"Routing of nets is one of the most time-consuming steps in the FPGA design flow. Existing works have described ways of accelerating the process through parallelization. However, only some of them are deterministic, and determinism is often achieved at the cost of speedup. In this paper, we propose ParaDRo, a parallel FPGA router based on spatial partitioning that achieves deterministic results while maintaining reasonable speedup. Existing spatial partitioning based routers do not scale well because the number of nets that can fully utilize all processors reduces as the number of processors increases. In addition, they route nets that are within a spatial partition sequentially. ParaDRo mitigates this problem by scheduling nets within a spatial partition to be routed in parallel if they do not have overlapping bounding boxes. Further parallelism is extracted by decomposing multi-sink nets into single-sink nets to minimize the amount of bounding box overlaps and increase the number of nets that can be routed in parallel. These improvements enable ParaDRo to achieve an average speedup of 5.4X with 8 threads with minimal impact on the quality of results.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124995582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Architecture Exploration for HLS-Oriented FPGA Debug Overlays 面向hls的FPGA调试覆盖的体系结构探索

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174254

Al-Shahna Jamal, Jeffrey B. Goeders, S. Wilton

High-Level Synthesis (HLS) promises improved designer productivity, but requires a debug ecosystem that allows designers to debug in the context of the original source code. Recent work has presented in-system debug frameworks where instrumentation added to the design collects trace data as the circuit runs, and a software tool that allows the user to replay the execution using the captured data. When searching for the root cause of a bug, the designer may need to modify the instrumentation to collect data from a new part of the design, requiring a lengthy recompile. In this paper, we propose a flexible debug overlay family that provides software-like debug turn-around times for HLS generated circuits. At compile time, the overlay is added to the design and compiled. At debug time, the overlay can be configured many times to implement specific debug scenarios without a recompilation. This paper first outlines a number of "capabilities" that such an overlay should have, and then describes architectural support for each of these capabilities. The cheapest overlay variant allows selective variable tracing with only a 1.7% increase in area overhead from the baseline debug instrumentation, while the deluxe variant offers 2x-7x improvement in trace buffer memory utilization with conditional buffer freeze support.

高级综合(High-Level Synthesis, HLS)承诺提高设计人员的工作效率，但需要一个调试生态系统，允许设计人员在原始源代码的上下文中进行调试。最近的工作提出了系统内调试框架，其中添加到设计中的仪表在电路运行时收集跟踪数据，以及允许用户使用捕获的数据重播执行的软件工具。在搜索bug的根本原因时，设计人员可能需要修改工具以从设计的新部分收集数据，这需要长时间的重新编译。在本文中，我们提出了一个灵活的调试覆盖系列，为HLS生成的电路提供类似软件的调试周期。在编译时，将覆盖层添加到设计中并进行编译。在调试时，可以多次配置覆盖以实现特定的调试场景，而无需重新编译。本文首先概述了这种覆盖应该具有的一些“功能”，然后描述了对这些功能的体系结构支持。最便宜的覆盖变体允许选择性变量跟踪，仅比基线调试仪器增加1.7%的面积开销，而豪华变体在跟踪缓冲区内存利用率方面提供了2 -7倍的改进，并支持条件缓冲区冻结。

{"title":"Architecture Exploration for HLS-Oriented FPGA Debug Overlays","authors":"Al-Shahna Jamal, Jeffrey B. Goeders, S. Wilton","doi":"10.1145/3174243.3174254","DOIUrl":"https://doi.org/10.1145/3174243.3174254","url":null,"abstract":"High-Level Synthesis (HLS) promises improved designer productivity, but requires a debug ecosystem that allows designers to debug in the context of the original source code. Recent work has presented in-system debug frameworks where instrumentation added to the design collects trace data as the circuit runs, and a software tool that allows the user to replay the execution using the captured data. When searching for the root cause of a bug, the designer may need to modify the instrumentation to collect data from a new part of the design, requiring a lengthy recompile. In this paper, we propose a flexible debug overlay family that provides software-like debug turn-around times for HLS generated circuits. At compile time, the overlay is added to the design and compiled. At debug time, the overlay can be configured many times to implement specific debug scenarios without a recompilation. This paper first outlines a number of \"capabilities\" that such an overlay should have, and then describes architectural support for each of these capabilities. The cheapest overlay variant allows selective variable tracing with only a 1.7% increase in area overhead from the baseline debug instrumentation, while the deluxe variant offers 2x-7x improvement in trace buffer memory utilization with conditional buffer freeze support.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126290004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Design of an MTJ-Based Nonvolatile LUT Circuit with a Data-Update Minimized Shift Operation for an Ultra-Low-Power FPGA: (Abstract Only) 基于mtj的超低功耗FPGA数据更新最小化移位非易失性LUT电路设计

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174984

D. Suzuki, T. Hanyu

Nonvolatile FPGAs (NV-FPGAs) have a potential advantage to eliminate wasted standby power which is increasingly serious in recent standard SRAM-based FPGAs. However, functionality of the conventional NV-FPGAs are not sufficient compared to that of standard SRAM-based FPGAs. For example, an effective circuit structure to perform shift-register (SR) function has not been proposed yet. In this paper, a magnetic tunnel junction (MTJ) based nonvolatile lookup table (NV-LUT) circuit that can perform SR function with low power consumption is proposed. The MTJ device is the best candidate in terms of virtually unlimited endurance, CMOS compatibility, and 3D stacking capability. On the other hand, large power consumption to perform SR function a serious design issue for the MTJ-based NV-LUT circuit. Since the write current for the MTJ device is large and all the data must be updated after the SR operation using CMOS-oriented method, large power consumption is indispensable. To overcome this issue, the address for read/write access is incremented at each cycle instead of direct data shifting in the proposed LUT circuit. In this way, the number of data update per 1-bit shift is minimized to one, which results in great power saving. Moreover, since the selector is shared both read (logic) and write operation, its hardware cost is small. In fact, 99% of power reduction and 52% of transistor counts reduction compared to those of SRAM-based LUT circuit are performed. The authors would like to acknowledge ImPACT of CSTI, CIES consortium program, JST-OPERA, and JSPS KAKENHI Grant No. 17H06093.

非易失性fpga (nv - fpga)在消除待机功率浪费方面具有潜在的优势，而待机功率浪费在最近基于sram的标准fpga中日益严重。然而，与标准的基于sram的fpga相比，传统的nv - fpga的功能是不够的。例如，实现移位寄存器(SR)功能的有效电路结构尚未被提出。本文提出了一种基于磁隧道结(MTJ)的非易失性查找表(NV-LUT)电路，该电路可以在低功耗下实现SR功能。MTJ器件在几乎无限的耐用性，CMOS兼容性和3D堆叠能力方面是最佳候选。另一方面，执行SR功能的大功耗是基于mtj的NV-LUT电路的一个严重设计问题。由于MTJ器件的写电流很大，并且采用面向cmos的方法进行SR操作后必须更新所有数据，因此大功耗是必不可少的。为了克服这个问题，在建议的LUT电路中，读/写访问的地址在每个周期递增，而不是直接进行数据移动。这样，每1位移位的数据更新次数减少到1，从而大大节省了电力。此外，由于选择器的读(逻辑)和写操作都是共享的，所以它的硬件开销很小。实际上，与基于sram的LUT电路相比，功耗降低99%，晶体管数量减少52%。作者感谢CSTI、CIES联合项目、JST-OPERA和JSPS KAKENHI Grant No. 17H06093的影响。

{"title":"Design of an MTJ-Based Nonvolatile LUT Circuit with a Data-Update Minimized Shift Operation for an Ultra-Low-Power FPGA: (Abstract Only)","authors":"D. Suzuki, T. Hanyu","doi":"10.1145/3174243.3174984","DOIUrl":"https://doi.org/10.1145/3174243.3174984","url":null,"abstract":"Nonvolatile FPGAs (NV-FPGAs) have a potential advantage to eliminate wasted standby power which is increasingly serious in recent standard SRAM-based FPGAs. However, functionality of the conventional NV-FPGAs are not sufficient compared to that of standard SRAM-based FPGAs. For example, an effective circuit structure to perform shift-register (SR) function has not been proposed yet. In this paper, a magnetic tunnel junction (MTJ) based nonvolatile lookup table (NV-LUT) circuit that can perform SR function with low power consumption is proposed. The MTJ device is the best candidate in terms of virtually unlimited endurance, CMOS compatibility, and 3D stacking capability. On the other hand, large power consumption to perform SR function a serious design issue for the MTJ-based NV-LUT circuit. Since the write current for the MTJ device is large and all the data must be updated after the SR operation using CMOS-oriented method, large power consumption is indispensable. To overcome this issue, the address for read/write access is incremented at each cycle instead of direct data shifting in the proposed LUT circuit. In this way, the number of data update per 1-bit shift is minimized to one, which results in great power saving. Moreover, since the selector is shared both read (logic) and write operation, its hardware cost is small. In fact, 99% of power reduction and 52% of transistor counts reduction compared to those of SRAM-based LUT circuit are performed. The authors would like to acknowledge ImPACT of CSTI, CIES consortium program, JST-OPERA, and JSPS KAKENHI Grant No. 17H06093.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133743775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Software-Defined FPGA-Based Accelerator for Deep Convolutional Neural Networks: (Abstract Only) 基于软件定义fpga的深度卷积神经网络加速器(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174983

Yankang Du, Qinrang Liu, Shuai Wei, Chen Gao

Now, Convolutional Neural Network (CNN) has gained great popularity. Intensive computation and huge external data access amount are two challenged factors for the hardware acceleration. Besides these, the ability to deal with various CNN models is also challenged. At present, most of the proposed FPGA-based CNN accelerator either can only deal with specific CNN models or should be re-coded and re-download on the FPGA for the different CNN models. This would bring great trouble for the developers. In this paper, we designed a software-defined architecture to cope with different CNN models while keeping high throughput. The hardware can be programmed according to the requirement. Several techniques are proposed to optimize the performance of our accelerators. For the convolutional layer, we proposed the software-defined data reuse technique to ensure that all the parameters can be only loaded once during the computing phase. This will reduce large off-chip data access amount and the need for the memory and the need for the memory bandwidth. By using the sparse property of the input feature map, almost 80% weight parameters can be skipped to be loaded in the full-connected (FC) layer. Compared to the previous works, our software-defined accelerator has the highest flexibility while keeping relative high throughout. Besides this, our accelerator also has lower off-chip data access amount which has a great effect on the power consumption.

现在，卷积神经网络(CNN)已经得到了很大的普及。密集的计算量和巨大的外部数据访问量是硬件加速面临的两大挑战。除此之外，处理各种CNN模型的能力也受到了挑战。目前，大多数提出的基于FPGA的CNN加速器要么只能处理特定的CNN模型，要么需要针对不同的CNN模型在FPGA上重新编码、重新下载。这会给开发者带来很大的麻烦。在本文中，我们设计了一个软件定义的架构，以应对不同的CNN模型，同时保持高吞吐量。硬件可根据需要进行编程。提出了几种优化加速器性能的技术。对于卷积层，我们提出了软件定义的数据重用技术，以确保所有参数在计算阶段只能加载一次。这将减少大量的片外数据访问量，减少对内存和内存带宽的需求。利用输入特征映射的稀疏特性，几乎可以跳过80%的权重参数加载到全连接层(FC)。与之前的作品相比，我们的软件定义加速器具有最高的灵活性，同时始终保持相对较高的速度。除此之外，我们的加速器还具有较低的片外数据访问量，这对功耗有很大的影响。

{"title":"Software-Defined FPGA-Based Accelerator for Deep Convolutional Neural Networks: (Abstract Only)","authors":"Yankang Du, Qinrang Liu, Shuai Wei, Chen Gao","doi":"10.1145/3174243.3174983","DOIUrl":"https://doi.org/10.1145/3174243.3174983","url":null,"abstract":"Now, Convolutional Neural Network (CNN) has gained great popularity. Intensive computation and huge external data access amount are two challenged factors for the hardware acceleration. Besides these, the ability to deal with various CNN models is also challenged. At present, most of the proposed FPGA-based CNN accelerator either can only deal with specific CNN models or should be re-coded and re-download on the FPGA for the different CNN models. This would bring great trouble for the developers. In this paper, we designed a software-defined architecture to cope with different CNN models while keeping high throughput. The hardware can be programmed according to the requirement. Several techniques are proposed to optimize the performance of our accelerators. For the convolutional layer, we proposed the software-defined data reuse technique to ensure that all the parameters can be only loaded once during the computing phase. This will reduce large off-chip data access amount and the need for the memory and the need for the memory bandwidth. By using the sparse property of the input feature map, almost 80% weight parameters can be skipped to be loaded in the full-connected (FC) layer. Compared to the previous works, our software-defined accelerator has the highest flexibility while keeping relative high throughout. Besides this, our accelerator also has lower off-chip data access amount which has a great effect on the power consumption.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115332601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

CausaLearn: Automated Framework for Scalable Streaming-based Causal Bayesian Learning using FPGAs causallearn:使用fpga的基于可扩展流的因果贝叶斯学习的自动化框架

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174259

B. Rouhani, M. Ghasemzadeh, F. Koushanfar

This paper proposes CausaLearn, the first automated framework that enables real-time and scalable approximation of Probability Density Function (PDF) in the context of causal Bayesian graphical models. CausaLearn targets complex streaming scenarios in which the input data evolves over time and independence cannot be assumed between data samples (e.g., continuous time-varying data analysis). Our framework is devised using a HW/SW co-design approach. We provide the first implementation of Hamiltonian Markov Chain Monte Carlo on FPGA that can efficiently sample from the steady state probability distribution at scales while considering the correlation between the observed data. CausaLearn is customizable to the limits of the underlying resource provisioning in order to maximize the effective system throughput. It uses physical profiling to abstract high-level hardware characteristics. These characteristics are integrated into our automated customization unit in order to tile, schedule, and batch the PDF approximation workload corresponding to the pertinent platform resources and constraints. We benchmark the design performance for analyzing various massive time-series data on three FPGA platforms with different computational budgets. Our extensive evaluations demonstrate up to two orders-of-magnitude runtime and energy improvements compared to the best-known prior solution. We provide an accompanying API that can be leveraged by data scientists and practitioners to automate and abstract hardware design optimization.

本文提出了causallearn，这是第一个自动化框架，可以在因果贝叶斯图形模型的背景下实时和可扩展地逼近概率密度函数(PDF)。causallearn的目标是复杂的流场景，在这些场景中，输入数据随着时间的推移而演变，并且不能假设数据样本之间的独立性(例如，连续时变数据分析)。我们的框架是使用硬件/软件协同设计方法设计的。我们提供了第一个在FPGA上实现的哈密顿马尔可夫链蒙特卡罗，它可以有效地从尺度上的稳态概率分布中采样，同时考虑到观测数据之间的相关性。causallearn可以根据底层资源配置的限制进行定制，以便最大化有效的系统吞吐量。它使用物理剖析来抽象高级硬件特征。这些特征被集成到我们的自动化定制单元中，以便根据相关的平台资源和约束对PDF近似工作负载进行平排、调度和批处理。我们对设计性能进行了基准测试，以分析不同计算预算的三种FPGA平台上的各种大规模时间序列数据。我们的广泛评估表明，与最知名的先前解决方案相比，该解决方案的运行时间和能耗提高了两个数量级。我们提供了一个附带的API，数据科学家和从业者可以利用它来自动化和抽象硬件设计优化。

{"title":"CausaLearn: Automated Framework for Scalable Streaming-based Causal Bayesian Learning using FPGAs","authors":"B. Rouhani, M. Ghasemzadeh, F. Koushanfar","doi":"10.1145/3174243.3174259","DOIUrl":"https://doi.org/10.1145/3174243.3174259","url":null,"abstract":"This paper proposes CausaLearn, the first automated framework that enables real-time and scalable approximation of Probability Density Function (PDF) in the context of causal Bayesian graphical models. CausaLearn targets complex streaming scenarios in which the input data evolves over time and independence cannot be assumed between data samples (e.g., continuous time-varying data analysis). Our framework is devised using a HW/SW co-design approach. We provide the first implementation of Hamiltonian Markov Chain Monte Carlo on FPGA that can efficiently sample from the steady state probability distribution at scales while considering the correlation between the observed data. CausaLearn is customizable to the limits of the underlying resource provisioning in order to maximize the effective system throughput. It uses physical profiling to abstract high-level hardware characteristics. These characteristics are integrated into our automated customization unit in order to tile, schedule, and batch the PDF approximation workload corresponding to the pertinent platform resources and constraints. We benchmark the design performance for analyzing various massive time-series data on three FPGA platforms with different computational budgets. Our extensive evaluations demonstrate up to two orders-of-magnitude runtime and energy improvements compared to the best-known prior solution. We provide an accompanying API that can be leveraged by data scientists and practitioners to automate and abstract hardware design optimization.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120869578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

LEOSoC: An Open-Source Cross-Platform Embedded Linux Library for Managing Hardware Accelerators in Heterogeneous System-on-Chips(Abstract Only) LEOSoC:用于异构片上系统硬件加速器管理的开源跨平台嵌入式Linux库(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3175002

Andrea Guerrieri, Sahand Kashani-Akhavan, Mikhail Asiatici, P. Lombardi, B. Belhadj, P. Ienne

Modern heterogeneous SoCs (System-on-Chip) contain a set of Hard IPs (HIPs) surrounded by an FPGA fabric for hosting custom Hardware Accelerators (HAs). However, efficiently managing such HAs in an embedded Linux environment involves creating and building custom device drivers specific to the target platform, which negatively impacts development cost, portability and time-to-market. To address this issue, we present LEOSoC, an open-source cross-platform embedded Linux library. LEOSoC reduces the development effort required to interface HAs with applications and makes SoCs easy to use for an embedded software developer who is familiar with the semantics of standard POSIX threads. Using LEOSoC does not require any specific version of the Linux kernel, nor to rebuild a custom driver for each new kernel release. LEOSoC consists of a base hardware system and a software layer. Both hardware and software are portable across SoC from various vendors and the library recognizes and auto-adapts to the target SoC platform on which it is running. Furthermore, LEOSoC allows the application to partially or completely change the structure of the HAs at runtime without rebooting the system by leveraging the underlying platforms? support for dynamic full/partial FPGA reconfigurability. The system has been tested on multiple COTS (Commercial Off The Shelf) boards from different vendors, each one running different versions of Linux and, therefore, proving the real portability and usability of LEOSoC in a specific industrial design.

现代异构soc(片上系统)包含一组由FPGA结构包围的硬ip (HIPs)，用于托管自定义硬件加速器(HAs)。然而，在嵌入式Linux环境中有效地管理这样的HAs涉及到创建和构建特定于目标平台的定制设备驱动程序，这会对开发成本、可移植性和上市时间产生负面影响。为了解决这个问题，我们提出了LEOSoC，一个开源的跨平台嵌入式Linux库。LEOSoC减少了将HAs与应用程序接口所需的开发工作量，并使熟悉标准POSIX线程语义的嵌入式软件开发人员易于使用soc。使用LEOSoC不需要任何特定版本的Linux内核，也不需要为每个新的内核版本重新构建自定义驱动程序。LEOSoC由基础硬件系统和软件层组成。硬件和软件都可以在不同供应商的SoC上移植，并且库可以识别并自动适应其运行的目标SoC平台。此外，LEOSoC允许应用程序在运行时部分或完全改变ha的结构，而无需通过利用底层平台重新启动系统。支持动态的FPGA全/部分可重构性。该系统已经在来自不同供应商的多个COTS(商用现货)板上进行了测试，每个板都运行不同版本的Linux，因此，在特定的工业设计中证明了LEOSoC的真正可移植性和可用性。

{"title":"LEOSoC: An Open-Source Cross-Platform Embedded Linux Library for Managing Hardware Accelerators in Heterogeneous System-on-Chips(Abstract Only)","authors":"Andrea Guerrieri, Sahand Kashani-Akhavan, Mikhail Asiatici, P. Lombardi, B. Belhadj, P. Ienne","doi":"10.1145/3174243.3175002","DOIUrl":"https://doi.org/10.1145/3174243.3175002","url":null,"abstract":"Modern heterogeneous SoCs (System-on-Chip) contain a set of Hard IPs (HIPs) surrounded by an FPGA fabric for hosting custom Hardware Accelerators (HAs). However, efficiently managing such HAs in an embedded Linux environment involves creating and building custom device drivers specific to the target platform, which negatively impacts development cost, portability and time-to-market. To address this issue, we present LEOSoC, an open-source cross-platform embedded Linux library. LEOSoC reduces the development effort required to interface HAs with applications and makes SoCs easy to use for an embedded software developer who is familiar with the semantics of standard POSIX threads. Using LEOSoC does not require any specific version of the Linux kernel, nor to rebuild a custom driver for each new kernel release. LEOSoC consists of a base hardware system and a software layer. Both hardware and software are portable across SoC from various vendors and the library recognizes and auto-adapts to the target SoC platform on which it is running. Furthermore, LEOSoC allows the application to partially or completely change the structure of the HAs at runtime without rebooting the system by leveraging the underlying platforms? support for dynamic full/partial FPGA reconfigurability. The system has been tested on multiple COTS (Commercial Off The Shelf) boards from different vendors, each one running different versions of Linux and, therefore, proving the real portability and usability of LEOSoC in a specific industrial design.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"470 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115870422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Continuous Skyline Computation Accelerator with Parallelizing Dominance Relation Calculations: (Abstract Only) 具有并行优势关系计算的连续Skyline计算加速器(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174961

Kenichi Koizumi, K. Hiraki, M. Inaba

Skyline Computation is a method for extracting interesting entries from a large population with multiple attributes. These entries, called skyline or Pareto optimal entries, are known to have extreme characteristics that cannot be found by using outlier detection methods. Skyline computation is an important task for characterizing large amounts of data and selecting interesting entries with extreme features. When the population changes dynamically, the task of calculating a sequence of skyline sets is called a continuous skyline computation. This task is known to be difficult for the following reasons: (1) information must be kept for non-skyline entries, since they may join the skyline in the future; (2) the appearance or disappearance of even a single entry can change the skyline drastically; and (3) it is difficult to adopt a geometric acceleration algorithm for skyline computation tasks with high-dimensional datasets. A new algorithm, called jointed rooted-tree (JR-tree), has been developed that manages entries using a rooted-tree structure. JR-tree delays extend the tree to deeper levels to accelerate tree construction and traversal. In this study, we propose the JR-tree based continuous skyline computation acceleration algorithm. Our hardware algorithm parallelizes the calculations of dominance relation between a target entry and the skyline entries. We implemented our hardware algorithm on an FPGA and showed that high-speed tree construction and traversal can be realized. Comparing our FPGA-based implementation with an Intel CPU running state-of-the-art software algorithms, it was found to reduce the query processing time for synthetic and real-world datasets. Our hardware implementation is 1.7x to 35x faster than the software implementations.

Skyline计算是一种从具有多个属性的大量人口中提取有趣条目的方法。这些条目被称为天际线或帕累托最优条目，已知具有使用离群值检测方法无法找到的极端特征。Skyline计算是描述大量数据和选择具有极端特征的有趣条目的重要任务。当种群动态变化时，计算一系列天际线集合的任务称为连续天际线计算。这项任务的困难之处在于:(1)非天际线条目必须保留信息，因为它们将来可能会加入天际线;(2)即使一个入口的出现或消失也会极大地改变天际线;(3)高维数据集的天际线计算任务难以采用几何加速算法。开发了一种新的算法，称为联合根树(joint root -tree, JR-tree)，它使用根树结构来管理条目。JR-tree延迟将树扩展到更深的层次，以加速树的构建和遍历。在本研究中，我们提出了基于jr树的连续天际线计算加速算法。我们的硬件算法并行计算目标条目和天际线条目之间的优势关系。我们在FPGA上实现了我们的硬件算法，并证明了该算法可以实现高速树构造和遍历。将我们基于fpga的实现与运行最先进软件算法的英特尔CPU进行比较，发现它减少了合成数据集和实际数据集的查询处理时间。我们的硬件实现比软件实现快1.7到35倍。

{"title":"Continuous Skyline Computation Accelerator with Parallelizing Dominance Relation Calculations: (Abstract Only)","authors":"Kenichi Koizumi, K. Hiraki, M. Inaba","doi":"10.1145/3174243.3174961","DOIUrl":"https://doi.org/10.1145/3174243.3174961","url":null,"abstract":"Skyline Computation is a method for extracting interesting entries from a large population with multiple attributes. These entries, called skyline or Pareto optimal entries, are known to have extreme characteristics that cannot be found by using outlier detection methods. Skyline computation is an important task for characterizing large amounts of data and selecting interesting entries with extreme features. When the population changes dynamically, the task of calculating a sequence of skyline sets is called a continuous skyline computation. This task is known to be difficult for the following reasons: (1) information must be kept for non-skyline entries, since they may join the skyline in the future; (2) the appearance or disappearance of even a single entry can change the skyline drastically; and (3) it is difficult to adopt a geometric acceleration algorithm for skyline computation tasks with high-dimensional datasets. A new algorithm, called jointed rooted-tree (JR-tree), has been developed that manages entries using a rooted-tree structure. JR-tree delays extend the tree to deeper levels to accelerate tree construction and traversal. In this study, we propose the JR-tree based continuous skyline computation acceleration algorithm. Our hardware algorithm parallelizes the calculations of dominance relation between a target entry and the skyline entries. We implemented our hardware algorithm on an FPGA and showed that high-speed tree construction and traversal can be realized. Comparing our FPGA-based implementation with an Intel CPU running state-of-the-art software algorithms, it was found to reduce the query processing time for synthetic and real-world datasets. Our hardware implementation is 1.7x to 35x faster than the software implementations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"385 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114899454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Session details: Session 2: CAD 会话详细信息:会话2:CAD

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3252937

Sabyasachi Das

引用次数: 0

Configurable FPGA Packet Parser for Terabit Networks with Guaranteed Wire-Speed Throughput 用于保证线速吞吐量的太比特网络的可配置FPGA数据包解析器

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174250

Jakub Cabal, Pavel Benácek, Lukás Kekely, Michal Kekely, V. Pus, J. Korenek

As throughput of computer networks is on a constant rise, there is a need for ever-faster packet parsing modules at all points of the networking infrastructure. Parsing is a crucial operation which has an influence on the final throughput of a network device. Moreover, this operation must precede any kind of further traffic processing like filtering/classification, deep packet inspection, and so on. This paper presents a parser architecture which is capable to currently scale up to a terabit throughput in a single FPGA, while the overall processing speed is sustained even on the shortest frame lengths and for an arbitrary number of supported protocols. The architecture of our parser can be also automatically generated from a high-level description of a protocol stack in the P4 language which makes the rapid deployment of new protocols considerably easier. The results presented in the paper confirm that our automatically generated parsers are capable of reaching an effective throughput of over 1 Tbps (or more than 2000 Mpps) on the Xilinx UltraScale+ FPGAs and around 800 Gbps (or more than 1200 Mpps) on their previous generation Virtex-7 FPGAs.

随着计算机网络的吞吐量不断上升，在网络基础设施的所有点上都需要更快的数据包解析模块。解析是影响网络设备最终吞吐量的关键操作。此外，此操作必须先于任何类型的进一步流量处理，如过滤/分类、深度包检测等。本文提出了一种解析器架构，该架构目前能够在单个FPGA中扩展到太比特吞吐量，而即使在最短的帧长度和任意数量的支持协议上也能保持总体处理速度。我们解析器的体系结构还可以从P4语言中协议栈的高级描述自动生成，这使得快速部署新协议变得相当容易。论文中提出的结果证实，我们自动生成的解析器能够在赛灵思UltraScale+ fpga上达到超过1 Tbps(或超过2000 Mpps)的有效吞吐量，在上一代Virtex-7 fpga上达到约800 Gbps(或超过1200 Mpps)。

{"title":"Configurable FPGA Packet Parser for Terabit Networks with Guaranteed Wire-Speed Throughput","authors":"Jakub Cabal, Pavel Benácek, Lukás Kekely, Michal Kekely, V. Pus, J. Korenek","doi":"10.1145/3174243.3174250","DOIUrl":"https://doi.org/10.1145/3174243.3174250","url":null,"abstract":"As throughput of computer networks is on a constant rise, there is a need for ever-faster packet parsing modules at all points of the networking infrastructure. Parsing is a crucial operation which has an influence on the final throughput of a network device. Moreover, this operation must precede any kind of further traffic processing like filtering/classification, deep packet inspection, and so on. This paper presents a parser architecture which is capable to currently scale up to a terabit throughput in a single FPGA, while the overall processing speed is sustained even on the shortest frame lengths and for an arbitrary number of supported protocols. The architecture of our parser can be also automatically generated from a high-level description of a protocol stack in the P4 language which makes the rapid deployment of new protocols considerably easier. The results presented in the paper confirm that our automatically generated parsers are capable of reaching an effective throughput of over 1 Tbps (or more than 2000 Mpps) on the Xilinx UltraScale+ FPGAs and around 800 Gbps (or more than 1200 Mpps) on their previous generation Virtex-7 FPGAs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115020717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20