Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第3页

Session details: Session 8: Applications 2 会话详细信息:会话8:应用

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3252943

Lesley Shannon

引用次数: 0

Memory-Efficient Fast Fourier Transform on Streaming Data by Fusing Permutations 基于融合置换的高效存储流数据快速傅里叶变换

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174263

F. Serre, Markus Püschel

We propose a novel FFT datapath that reduces the memory requirement compared to state-of-the-art RAM-based implementations by up to a factor of two. The novelty is in a technique to fuse the datapaths for the required perfect shuffle and bit reversal and is applicable to an entire design space of FFT implementations with varying degrees of reuse and number of input ports. We implemented a tool to generate this FFT design space for a given input size and to benchmark against prior work. The results show a reduction of half the RAM banks and/or half the logic complexity used for the permutations. The technique for fusing permutations is more generally applicable beyond the FFT.

我们提出了一种新颖的FFT数据路径，与最先进的基于ram的实现相比，它将内存需求降低了两倍。其新颖之处在于融合数据路径以实现所需的完美洗牌和位反转的技术，并且适用于具有不同程度重用和输入端口数量的FFT实现的整个设计空间。我们实现了一个工具来为给定的输入大小生成这个FFT设计空间，并对先前的工作进行基准测试。结果表明，用于排列的RAM组和/或逻辑复杂性减少了一半。融合排列的技术更普遍地适用于FFT以外的领域。

引用次数: 2

A HOG-based Real-time and Multi-scale Pedestrian Detector Demonstration System on FPGA 基于hog的实时多尺度行人检测器FPGA演示系统

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174249

Jan Dürre, Dario Paradzik, H. Blume

Pedestrian detection will play a major role in future driver assistance and autonomous driving. One powerful algorithm in this field uses HOG features to describe the specific properties of pedestrians in images. To determine their locations, features are extracted and classified window-wise from different scales of an input image. The results of the classification are finally merged to remove overlapping detections. The real-time execution of this method requires specific FPGA- or ASIC-architectures. Recent work focused on accelerating the feature extraction and classification. Although merging is an important step in the algorithm, it is only rarely considered in hardware implementations. A reason for that could be its complexity and irregularity that is not trivial to implement in hardware. In this paper, we present a new bottom-up FPGA architecture that maps the full HOG-based algorithm for pedestrian detection including feature extraction, SVM classification, and multi-scale processing in combination with merging. For that purpose, we also propose a new hardware-optimized merging method. The resulting architecture is highly efficient. Additionally, we present an FPGA-based full real-time and multi-scale pedestrian detection demonstration system.

行人检测将在未来的驾驶辅助和自动驾驶中发挥重要作用。该领域的一个强大算法使用HOG特征来描述图像中行人的特定属性。为了确定它们的位置，从输入图像的不同尺度提取特征并按窗口分类。最后将分类结果合并以去除重叠检测。实时执行这个方法需要特定的FPGA或ASIC-architectures。最近的工作集中在加速特征提取和分类上。虽然合并是算法中的一个重要步骤，但在硬件实现中很少考虑合并。其原因可能是它的复杂性和不规则性，这在硬件中实现并不容易。在本文中，我们提出了一种新的自下而上的FPGA架构，该架构映射了完整的基于hog的行人检测算法，包括特征提取，SVM分类和多尺度处理与合并。为此，我们还提出了一种新的硬件优化合并方法。得到的体系结构非常高效。此外，我们还提出了一个基于fpga的全实时多尺度行人检测演示系统。

{"title":"A HOG-based Real-time and Multi-scale Pedestrian Detector Demonstration System on FPGA","authors":"Jan Dürre, Dario Paradzik, H. Blume","doi":"10.1145/3174243.3174249","DOIUrl":"https://doi.org/10.1145/3174243.3174249","url":null,"abstract":"Pedestrian detection will play a major role in future driver assistance and autonomous driving. One powerful algorithm in this field uses HOG features to describe the specific properties of pedestrians in images. To determine their locations, features are extracted and classified window-wise from different scales of an input image. The results of the classification are finally merged to remove overlapping detections. The real-time execution of this method requires specific FPGA- or ASIC-architectures. Recent work focused on accelerating the feature extraction and classification. Although merging is an important step in the algorithm, it is only rarely considered in hardware implementations. A reason for that could be its complexity and irregularity that is not trivial to implement in hardware. In this paper, we present a new bottom-up FPGA architecture that maps the full HOG-based algorithm for pedestrian detection including feature extraction, SVM classification, and multi-scale processing in combination with merging. For that purpose, we also propose a new hardware-optimized merging method. The resulting architecture is highly efficient. Additionally, we present an FPGA-based full real-time and multi-scale pedestrian detection demonstration system.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129004141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

K-Flow: A Programming and Scheduling Framework to Optimize Dataflow Execution on CPU-FPGA Platforms: (Abstract Only) K-Flow:一个优化CPU-FPGA平台上数据流执行的编程和调度框架(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174968

J. Cong, Zhenman Fang, Yao Hu, Di Wu

With the slowing down of Moore's law, major cloud service providers---such as Amazon Web Services, Microsoft Azure, and Alibaba Cloud---all started deploying FPGAs in their cloud platforms to improve the performance and energy-efficiency. From the perspective of performance per unit cost in the cloud, it is essential to efficiently utilize all available CPU and FPGA resources within a requested computing instance. However, most prior studies overlook the CPU-FPGA co-optimization or require a considerable amount of manual efforts to achieve it. In this poster, we present a framework called K-Flow, which enables easy FPGA accelerator integration and efficient CPU-FPGA co-scheduling for big data applications. K-Flow abstracts an application as a widely used directed acyclic graph (DAG), and dynamically schedules a number of CPU threads and/or FPGA accelerator processing elements (PEs) to execute the dataflow tasks on each DAG node. Moreover, K-Flow provides user-friendly interfaces to program each DAG node and automates the tedious process of FPGA accelerator integration and CPU-FPGA co-optimization using the genomic read alignment application BWA-MEM as a case study. Experimental results show that K-Flow achieves a throughput that is on average 94.5% of the theoretical upper bound and 1.4x better than a straightforward FPGA integration.

随着摩尔定律的放缓，主要的云服务提供商，如亚马逊网络服务、微软Azure和阿里云，都开始在他们的云平台上部署fpga，以提高性能和能效。从云中单位成本性能的角度来看，在请求的计算实例中有效利用所有可用的CPU和FPGA资源是至关重要的。然而，大多数先前的研究忽略了CPU-FPGA协同优化，或者需要大量的人工努力来实现它。在这张海报中，我们提出了一个名为K-Flow的框架，它可以为大数据应用提供简单的FPGA加速器集成和高效的CPU-FPGA协同调度。K-Flow将应用抽象为广泛使用的有向无环图(DAG)，并动态调度一些CPU线程和/或FPGA加速器处理元素(pe)来执行每个DAG节点上的数据流任务。此外，K-Flow提供了用户友好的界面来对每个DAG节点进行编程，并以基因组读比对应用程序BWA-MEM为例，自动化了FPGA加速器集成和CPU-FPGA协同优化的繁琐过程。实验结果表明，K-Flow实现的吞吐量平均为理论上限的94.5%，比直接的FPGA集成好1.4倍。

{"title":"K-Flow: A Programming and Scheduling Framework to Optimize Dataflow Execution on CPU-FPGA Platforms: (Abstract Only)","authors":"J. Cong, Zhenman Fang, Yao Hu, Di Wu","doi":"10.1145/3174243.3174968","DOIUrl":"https://doi.org/10.1145/3174243.3174968","url":null,"abstract":"With the slowing down of Moore's law, major cloud service providers---such as Amazon Web Services, Microsoft Azure, and Alibaba Cloud---all started deploying FPGAs in their cloud platforms to improve the performance and energy-efficiency. From the perspective of performance per unit cost in the cloud, it is essential to efficiently utilize all available CPU and FPGA resources within a requested computing instance. However, most prior studies overlook the CPU-FPGA co-optimization or require a considerable amount of manual efforts to achieve it. In this poster, we present a framework called K-Flow, which enables easy FPGA accelerator integration and efficient CPU-FPGA co-scheduling for big data applications. K-Flow abstracts an application as a widely used directed acyclic graph (DAG), and dynamically schedules a number of CPU threads and/or FPGA accelerator processing elements (PEs) to execute the dataflow tasks on each DAG node. Moreover, K-Flow provides user-friendly interfaces to program each DAG node and automates the tedious process of FPGA accelerator integration and CPU-FPGA co-optimization using the genomic read alignment application BWA-MEM as a case study. Experimental results show that K-Flow achieves a throughput that is on average 94.5% of the theoretical upper bound and 1.4x better than a straightforward FPGA integration.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126690864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform 基于FPGA-HMC平台的度感知混合图遍历

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174245

Jialiang Zhang, J. Li

Graph traversal is a core primitive for graph analytics and a basis for many higher-level graph analysis methods. However, irregularities in the structure of scale-free graphs (e.g., social network) limit our ability to analyze these important and growing datasets. A key challenge is the redundant graph computations caused by the presence of high-degree vertices which not only increase the total amount of computations but also incur unnecessary random data access. In this paper, we present a graph processing system on an FPGA-HMC platform, based on software/hardware co-design and co- optimization. For the first time, we leverage the inherent graph property i.e. vertex degree to co-optimize algorithm and hardware architecture. In particular, we first develop two algorithm optimization techniques:degree-aware adjacency list reordering anddegree-aware vertex index sorting. The former can reduce the number of redundant graph computations, while the latter can create a strong correlation between vertex index and data access frequency, which can be effectively applied to guide the hardware design. We further implement the optimized hybrid graph traversal algorithm on an FPGA-HMC platform. By leveraging the strong correlation between vertex index and data access frequency made by degree-aware vertex index sorting, we develop two platform-dependent hardware optimization techniques, namely degree-aware data placement and degree-aware adjacency list compression. These two techniques together substantially reduce the amount of access to external memory. Finally, we conduct extensive experiments on an FPGA-HMC platform to verify the effectiveness of the proposed techniques. To the best of our knowledge, our implementation achieves the highest performance (45.8 billion traversed edges per second) among existing FPGA-based graph processing systems.

图遍历是图分析的核心原语，也是许多高级图分析方法的基础。然而，无标度图(如社交网络)结构中的不规则性限制了我们分析这些重要且不断增长的数据集的能力。一个关键的挑战是由于高度顶点的存在而导致的冗余图计算，这不仅增加了计算总量，而且还导致不必要的随机数据访问。本文提出了一种基于软硬件协同设计和协同优化的FPGA-HMC平台图形处理系统。我们首次利用图的固有属性，即顶点度来协同优化算法和硬件架构。特别是，我们首先开发了两种算法优化技术:度感知邻接表重排序和度感知顶点索引排序。前者可以减少冗余的图计算次数，后者可以在顶点索引和数据访问频率之间建立强相关性，可以有效地应用于指导硬件设计。在FPGA-HMC平台上进一步实现了优化后的混合图遍历算法。利用度感知顶点索引排序所产生的顶点索引与数据访问频率之间的强相关性，我们开发了两种平台相关的硬件优化技术，即度感知数据放置和度感知邻接表压缩。这两种技术一起大大减少了对外部存储器的访问量。最后，我们在FPGA-HMC平台上进行了大量的实验来验证所提出技术的有效性。据我们所知，我们的实现在现有的基于fpga的图形处理系统中实现了最高的性能(每秒458亿遍历边)。

{"title":"Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform","authors":"Jialiang Zhang, J. Li","doi":"10.1145/3174243.3174245","DOIUrl":"https://doi.org/10.1145/3174243.3174245","url":null,"abstract":"Graph traversal is a core primitive for graph analytics and a basis for many higher-level graph analysis methods. However, irregularities in the structure of scale-free graphs (e.g., social network) limit our ability to analyze these important and growing datasets. A key challenge is the redundant graph computations caused by the presence of high-degree vertices which not only increase the total amount of computations but also incur unnecessary random data access. In this paper, we present a graph processing system on an FPGA-HMC platform, based on software/hardware co-design and co- optimization. For the first time, we leverage the inherent graph property i.e. vertex degree to co-optimize algorithm and hardware architecture. In particular, we first develop two algorithm optimization techniques:degree-aware adjacency list reordering anddegree-aware vertex index sorting. The former can reduce the number of redundant graph computations, while the latter can create a strong correlation between vertex index and data access frequency, which can be effectively applied to guide the hardware design. We further implement the optimized hybrid graph traversal algorithm on an FPGA-HMC platform. By leveraging the strong correlation between vertex index and data access frequency made by degree-aware vertex index sorting, we develop two platform-dependent hardware optimization techniques, namely degree-aware data placement and degree-aware adjacency list compression. These two techniques together substantially reduce the amount of access to external memory. Finally, we conduct extensive experiments on an FPGA-HMC platform to verify the effectiveness of the proposed techniques. To the best of our knowledge, our implementation achieves the highest performance (45.8 billion traversed edges per second) among existing FPGA-based graph processing systems.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126261862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

FastTrack: Exploiting Fast FPGA Wiring for Implementing NoC Shortcuts (Abstract Only) FastTrack:利用快速FPGA布线实现NoC捷径(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174962

Nachiket Kapre, T. Krishna

The latency of packet-switched FPGA overlay Networks-on-Chip (NoCs) goes up linearly with the NoC dimensions, since packets typically spend a cycle in each dynamic router along the path. High-performance FPGA NoCs have to aggressively pipeline interconnects, thereby adding extra latency overhead to the NoC. The use of FPGA-friendly deflection routing schemes further exacerbates latency. Fortunately, FPGAs provide segmented interconnects with different lengths (speeds). Faster FPGA tracks can be used to reduce the number of switchbox hops along the packet path. We introduce FastTrack, an adaption to the NoC organization that inserts express bypass links in the NoC to skip multiple router stages in a single clock cycle. Our FastTrack design can be tuned to support different express link lengths for performance, and depopulation strategies for controlling cost. For the Xilinx Virtex-7 485T FPGA, an 8×8 FastTrack NoC is 2× larger than a base Hoplite NoC, but operates between 1.2-0.8× its clock frequency when using express links of length 2-4. FastTrack delivers throughput and latency improvements across a range of statistical workloads (2-2.5×), and traces extracted from FPGA accelerator case studies such as Sparse Matrix-Vector Multiplication (2.5×), Graph Analytics (2.8×), and Multi-processor overlay applications (2×). FastTrack also shows energy efficiency improvements by factors of up to 2× over baseline Hoplite due to higher sustained rates and high speed operation of express links made possible by fast FPGA interconnect.

分组交换的FPGA覆盖片上网络(NoC)的延迟随着NoC维度线性上升，因为数据包通常在每个动态路由器上沿路径花费一个周期。高性能FPGA NoC必须积极地进行管道互连，从而为NoC增加了额外的延迟开销。使用fpga友好的偏转路由方案进一步加剧了延迟。幸运的是，fpga提供了不同长度(速度)的分段互连。可以使用更快的FPGA轨道来减少分组路径上的开关箱跳数。我们介绍了FastTrack，这是对NoC组织的一种适应，它在NoC中插入快速旁路链路，以便在单个时钟周期内跳过多个路由器阶段。我们的快速轨道设计可以调整，以支持不同的快速链路长度的性能和减少人口的策略，以控制成本。对于Xilinx Virtex-7 485T FPGA, 8×8 FastTrack NoC比基础Hoplite NoC大2倍，但在使用长度为2-4的快速链路时，其时钟频率在1.2-0.8倍之间。FastTrack在一系列统计工作负载(2-2.5倍)中提供吞吐量和延迟改进，并从FPGA加速器案例研究中提取跟踪，如稀疏矩阵向量乘法(2.5倍)，图形分析(2.8倍)和多处理器覆盖应用(2x)。FastTrack还显示，由于快速FPGA互连实现了更高的持续速率和高速运行的快速链路，能效提高了基线Hoplite的2倍。

{"title":"FastTrack: Exploiting Fast FPGA Wiring for Implementing NoC Shortcuts (Abstract Only)","authors":"Nachiket Kapre, T. Krishna","doi":"10.1145/3174243.3174962","DOIUrl":"https://doi.org/10.1145/3174243.3174962","url":null,"abstract":"The latency of packet-switched FPGA overlay Networks-on-Chip (NoCs) goes up linearly with the NoC dimensions, since packets typically spend a cycle in each dynamic router along the path. High-performance FPGA NoCs have to aggressively pipeline interconnects, thereby adding extra latency overhead to the NoC. The use of FPGA-friendly deflection routing schemes further exacerbates latency. Fortunately, FPGAs provide segmented interconnects with different lengths (speeds). Faster FPGA tracks can be used to reduce the number of switchbox hops along the packet path. We introduce FastTrack, an adaption to the NoC organization that inserts express bypass links in the NoC to skip multiple router stages in a single clock cycle. Our FastTrack design can be tuned to support different express link lengths for performance, and depopulation strategies for controlling cost. For the Xilinx Virtex-7 485T FPGA, an 8×8 FastTrack NoC is 2× larger than a base Hoplite NoC, but operates between 1.2-0.8× its clock frequency when using express links of length 2-4. FastTrack delivers throughput and latency improvements across a range of statistical workloads (2-2.5×), and traces extracted from FPGA accelerator case studies such as Sparse Matrix-Vector Multiplication (2.5×), Graph Analytics (2.8×), and Multi-processor overlay applications (2×). FastTrack also shows energy efficiency improvements by factors of up to 2× over baseline Hoplite due to higher sustained rates and high speed operation of express links made possible by fast FPGA interconnect.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116489855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs Rosetta:软件可编程fpga的现实高级综合基准套件

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174255

Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Kumar Srivastava, Hanchen Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, Wenping Wang, Zhiru Zhang

Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at the register-transfer level. With the increasing adoption of the HLS design methodology and continued advances of synthesis optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate and stress-test new synthesis techniques, and (3) establish meaningful performance baselines to track progress of the HLS technology. While several HLS benchmark suites already exist, they are primarily comprised of small textbook-style function kernels, instead of complete and complex applications. To address this limitation, we introduce Rosetta, a realistic benchmark suite for software programmable FPGAs. Designs in Rosetta are fully-developed applications. They are associated with realistic performance constraints, and optimized with advanced features of modern HLS tools. We believe that Rosetta is not only useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users. In this paper we describe the characteristics of our benchmarks and the optimization techniques applied to them. We further report experimental results on an embedded FPGA device as well as a cloud FPGA platform.

现代高级综合(HLS)工具大大减少了设计和实现复杂的基于fpga的加速器的周转时间。它们还暴露了各种优化机会，这些机会在寄存器传输级别上不容易探索。随着HLS设计方法的越来越多的采用和合成优化的不断进步，人们越来越需要现实的基准来(1)促进工具之间的比较，(2)评估和压力测试新的合成技术，以及(3)建立有意义的性能基线来跟踪HLS技术的进展。虽然已经存在几个HLS基准套件，但它们主要由小型教科书样式的函数内核组成，而不是完整而复杂的应用程序。为了解决这一限制，我们介绍了Rosetta，一个现实的软件可编程fpga基准套件。Rosetta中的设计是完全开发的应用程序。它们与实际的性能约束相关联，并使用现代HLS工具的高级功能进行了优化。我们相信，Rosetta不仅对HLS研究社区有用，而且还可以作为非HLS专家用户的一套设计教程。在本文中，我们描述了我们的基准测试的特点和应用于它们的优化技术。我们进一步报告了在嵌入式FPGA器件和云FPGA平台上的实验结果。

{"title":"Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs","authors":"Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Kumar Srivastava, Hanchen Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, Wenping Wang, Zhiru Zhang","doi":"10.1145/3174243.3174255","DOIUrl":"https://doi.org/10.1145/3174243.3174255","url":null,"abstract":"Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at the register-transfer level. With the increasing adoption of the HLS design methodology and continued advances of synthesis optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate and stress-test new synthesis techniques, and (3) establish meaningful performance baselines to track progress of the HLS technology. While several HLS benchmark suites already exist, they are primarily comprised of small textbook-style function kernels, instead of complete and complex applications. To address this limitation, we introduce Rosetta, a realistic benchmark suite for software programmable FPGAs. Designs in Rosetta are fully-developed applications. They are associated with realistic performance constraints, and optimized with advanced features of modern HLS tools. We believe that Rosetta is not only useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users. In this paper we describe the characteristics of our benchmarks and the optimization techniques applied to them. We further report experimental results on an embedded FPGA device as well as a cloud FPGA platform.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"06 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129685489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

DATuner: An Extensible Distributed Autotuning Framework for FPGA Design and Design Automation: (Abstract Only) DATuner:面向FPGA设计和设计自动化的可扩展分布式自调谐框架(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174978

Gai Liu, Ecenur Ustun, Shaojie Xiang, Chang Xu, Guojie Luo, Zhiru Zhang

Mainstream FPGA tools contain an extensive set of user-controlled compilation options and internal optimization strategies that significantly impact the design quality. These compilation and optimization parameters create a complex design space that human designers may not be able to effectively explore in a time-efficient manner. In this work we describe DATuner, an open-source extensible distributed autotuning framework for optimizing FPGA designs and design automation tools using an ensemble of search techniques managed by multi-armed bandit algorithms. DATuner is designed for a distributed environment that uses parallel searches to amortize the significant runtime overhead of the CAD tools. DATuner provides convenient interface for extension to user-supplied tools, which enables the end users to apply DATuner to design tools/flows of their interest. We demonstrate the effectiveness and extensibility of DATuner using three case studies, which include clock frequency optimization for FPGA compilation, fixed-point optimization, and autotuning logic synthesis transformations.

主流的FPGA工具包含一套广泛的用户控制的编译选项和内部优化策略，这对设计质量有很大的影响。这些编译和优化参数创建了一个复杂的设计空间，人类设计师可能无法以高效的方式有效地探索。在这项工作中，我们描述了DATuner，一个开源的可扩展分布式自动调优框架，用于优化FPGA设计和设计自动化工具，使用由多臂强盗算法管理的搜索技术集合。DATuner是为分布式环境设计的，该环境使用并行搜索来分摊CAD工具的重大运行时开销。DATuner为扩展用户提供的工具提供了方便的接口，这使得最终用户能够应用DATuner来设计他们感兴趣的工具/流程。我们通过三个案例研究证明了DATuner的有效性和可扩展性，其中包括FPGA编译的时钟频率优化，定点优化和自动调谐逻辑合成转换。

{"title":"DATuner: An Extensible Distributed Autotuning Framework for FPGA Design and Design Automation: (Abstract Only)","authors":"Gai Liu, Ecenur Ustun, Shaojie Xiang, Chang Xu, Guojie Luo, Zhiru Zhang","doi":"10.1145/3174243.3174978","DOIUrl":"https://doi.org/10.1145/3174243.3174978","url":null,"abstract":"Mainstream FPGA tools contain an extensive set of user-controlled compilation options and internal optimization strategies that significantly impact the design quality. These compilation and optimization parameters create a complex design space that human designers may not be able to effectively explore in a time-efficient manner. In this work we describe DATuner, an open-source extensible distributed autotuning framework for optimizing FPGA designs and design automation tools using an ensemble of search techniques managed by multi-armed bandit algorithms. DATuner is designed for a distributed environment that uses parallel searches to amortize the significant runtime overhead of the CAD tools. DATuner provides convenient interface for extension to user-supplied tools, which enables the end users to apply DATuner to design tools/flows of their interest. We demonstrate the effectiveness and extensibility of DATuner using three case studies, which include clock frequency optimization for FPGA compilation, fixed-point optimization, and autotuning logic synthesis transformations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130665304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering FASTCF:基于fpga的随机梯度下降协同滤波加速器

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174252

Shijie Zhou, R. Kannan, Yu Min, V. Prasanna

Sparse matrix factorization using Stochastic Gradient Descent (SGD) is a popular technique for deriving latent features from observations. SGD is widely used for Collaborative Filtering (CF), itself a well-known machine learning technique for recommender systems. In this paper, we develop an FPGA-based accelerator, FASTCF, to accelerate the SGD-based CF algorithm. FASTCF consists of parallel, pipelined processing units which concurrently process distinct user ratings by accessing a shared on-chip buffer. We design FASTCF through a holistic analysis of the specific design challenges for the acceleration of SGD-based CF on FPGA. Based on our analysis of these design challenges, we develop a bipartite graph processing approach with a novel 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of on-chip feature vector data to significantly accelerate the processing of this bipartite graph. First, we develop a fast heuristic to partition the input graph into induced subgraphs; this enables FASTCF to efficiently buffer vertex data for reuse and completely hide communication overhead. Second, we partition all the edges of each subgraph into matchings to extract the maximum parallelism. Third, we schedule the execution of the edges inside each matching to reduce concurrent memory access conflicts to the shared on-chip buffer. Compared with non-optimized baseline designs, the hierarchical partitioning approach results in up to 60x data dependency reduction, 4.2x bank conflict reduction, and 15.4x speedup. We implement FASTCF based on state-of-the-art FPGA and evaluate its performance using three large real-life datasets. Experimental results show that FASTCF sustains a high throughput of up to 217 billion floating-point operations per second (GFLOPS). Compared with state-of-the-art multi-core and GPU implementations, FASTCF demonstrates 13.3x and 12.7x speedup, respectively.

利用随机梯度下降(SGD)进行稀疏矩阵分解是一种常用的从观测数据中提取潜在特征的方法。SGD被广泛用于协同过滤(CF)，协同过滤本身就是一种著名的推荐系统机器学习技术。在本文中，我们开发了一个基于fpga的加速器FASTCF来加速基于sgd的CF算法。FASTCF由并行的流水线处理单元组成，这些处理单元通过访问共享的片上缓冲区并发地处理不同的用户等级。我们通过对FPGA上基于sgd的CF加速的具体设计挑战的整体分析来设计FASTCF。基于我们对这些设计挑战的分析，我们开发了一种具有新颖的3级分层划分方案的二部图处理方法，该方案使片上特征向量数据的冲突最小化调度和处理能够显着加速该二部图的处理。首先，我们开发了一种快速启发式算法，将输入图划分为诱导子图;这使得FASTCF能够有效地缓冲顶点数据以供重用，并完全隐藏通信开销。其次，我们将每个子图的所有边划分为匹配，以提取最大并行度。第三，我们调度每个匹配内部的边的执行，以减少对共享片上缓冲区的并发内存访问冲突。与未优化的基线设计相比，分层分区方法的数据依赖性降低了60倍，银行冲突减少了4.2倍，速度提高了15.4倍。我们基于最先进的FPGA实现FASTCF，并使用三个大型现实数据集评估其性能。实验结果表明，FASTCF保持高达每秒2170亿次浮点运算(GFLOPS)的高吞吐量。与最先进的多核和GPU实现相比，FASTCF的加速速度分别为13.3倍和12.7倍。

{"title":"FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering","authors":"Shijie Zhou, R. Kannan, Yu Min, V. Prasanna","doi":"10.1145/3174243.3174252","DOIUrl":"https://doi.org/10.1145/3174243.3174252","url":null,"abstract":"Sparse matrix factorization using Stochastic Gradient Descent (SGD) is a popular technique for deriving latent features from observations. SGD is widely used for Collaborative Filtering (CF), itself a well-known machine learning technique for recommender systems. In this paper, we develop an FPGA-based accelerator, FASTCF, to accelerate the SGD-based CF algorithm. FASTCF consists of parallel, pipelined processing units which concurrently process distinct user ratings by accessing a shared on-chip buffer. We design FASTCF through a holistic analysis of the specific design challenges for the acceleration of SGD-based CF on FPGA. Based on our analysis of these design challenges, we develop a bipartite graph processing approach with a novel 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of on-chip feature vector data to significantly accelerate the processing of this bipartite graph. First, we develop a fast heuristic to partition the input graph into induced subgraphs; this enables FASTCF to efficiently buffer vertex data for reuse and completely hide communication overhead. Second, we partition all the edges of each subgraph into matchings to extract the maximum parallelism. Third, we schedule the execution of the edges inside each matching to reduce concurrent memory access conflicts to the shared on-chip buffer. Compared with non-optimized baseline designs, the hierarchical partitioning approach results in up to 60x data dependency reduction, 4.2x bank conflict reduction, and 15.4x speedup. We implement FASTCF based on state-of-the-art FPGA and evaluate its performance using three large real-life datasets. Experimental results show that FASTCF sustains a high throughput of up to 217 billion floating-point operations per second (GFLOPS). Compared with state-of-the-art multi-core and GPU implementations, FASTCF demonstrates 13.3x and 12.7x speedup, respectively.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131994211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A FPGA Friendly Approximate Computing Framework with Hybrid Neural Networks: (Abstract Only) 一种FPGA友好的混合神经网络近似计算框架(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174965

Haiyue Song, Xiang Song, Tianjian Li, Hao Dong, Naifeng Jing, Xiaoyao Liang, Li Jiang

Neural approximate computing is promising to gain energy-efficiency at the cost of tolerable quality loss. The architecture contains two neural networks: the approximate accelerator generates approximate results while the classifier determines whether input data can be safely approximated. However, they are not compatible to a heterogeneous computing platform, due to the large communication overhead between the approximate accelerator and accurate cores, and the large speed gap between them. This paper proposes a software-hardware co-design strategy. With deep exploration of data distributions in the feature space, we first propose a novel approximate computing architecture containing a multi-class classifier and multiple approximate accelerator; this architecture, derived by the existing iterative co-training methods, can shift more data from accurate computation (in CPU) to approximate accelerator (in FPGA); the increased invocation of the approximate accelerator thus can yield higher utilization of the FPGA-based accelerator, resulting in the enhanced the performance. Moreover, much less input data is redistributed, by the classifier (also in FPGA), back to CPU, which can minimize the CPU-FPGA communication. Second, we design a pipelined data-path with batched input/output for the proposed hybrid architecture to efficiently hide the communication latency. A mask technique is proposed to decouple the synchronization between CPU and FPGA, in order to minimize the frequency of communication.

神经近似计算有望以可容忍的质量损失为代价获得能源效率。该体系结构包含两个神经网络:近似加速器生成近似结果，分类器确定输入数据是否可以安全近似。然而，由于近似加速器和精确内核之间的通信开销很大，并且它们之间的速度差距很大，因此它们不兼容于异构计算平台。本文提出了一种软硬件协同设计策略。通过对特征空间中数据分布的深入探索，我们首先提出了一种包含多类分类器和多个近似加速器的新型近似计算架构;该架构由现有的迭代协同训练方法衍生而来，可以将更多的数据从精确计算(在CPU上)转移到近似加速(在FPGA上);因此，增加近似加速器的调用可以提高基于fpga的加速器的利用率，从而提高性能。此外，通过分类器(也在FPGA中)将更少的输入数据重新分配回CPU，这可以最大限度地减少CPU-FPGA通信。其次，我们为所提出的混合架构设计了一个具有批处理输入/输出的管道数据路径，以有效地隐藏通信延迟。为了降低通信频率，提出了一种掩码技术来解耦CPU和FPGA之间的同步。

{"title":"A FPGA Friendly Approximate Computing Framework with Hybrid Neural Networks: (Abstract Only)","authors":"Haiyue Song, Xiang Song, Tianjian Li, Hao Dong, Naifeng Jing, Xiaoyao Liang, Li Jiang","doi":"10.1145/3174243.3174965","DOIUrl":"https://doi.org/10.1145/3174243.3174965","url":null,"abstract":"Neural approximate computing is promising to gain energy-efficiency at the cost of tolerable quality loss. The architecture contains two neural networks: the approximate accelerator generates approximate results while the classifier determines whether input data can be safely approximated. However, they are not compatible to a heterogeneous computing platform, due to the large communication overhead between the approximate accelerator and accurate cores, and the large speed gap between them. This paper proposes a software-hardware co-design strategy. With deep exploration of data distributions in the feature space, we first propose a novel approximate computing architecture containing a multi-class classifier and multiple approximate accelerator; this architecture, derived by the existing iterative co-training methods, can shift more data from accurate computation (in CPU) to approximate accelerator (in FPGA); the increased invocation of the approximate accelerator thus can yield higher utilization of the FPGA-based accelerator, resulting in the enhanced the performance. Moreover, much less input data is redistributed, by the classifier (also in FPGA), back to CPU, which can minimize the CPU-FPGA communication. Second, we design a pipelined data-path with batched input/output for the proposed hybrid architecture to efficiently hide the communication latency. A mask technique is proposed to decouple the synchronization between CPU and FPGA, in order to minimize the frequency of communication.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114476551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2