FLASH: fpga加速的GCN智能交换机案例研究

Proceedings of the 37th International Conference on Supercomputing Pub Date : 2023-06-21 DOI:10.1145/3577193.3593739

Pouya Haghi, William Krska, Cheng Tan, Tong Geng, P. Chen, Connor Greenwood, Anqi Guo, Thomas M. Hines, Chunshu Wu, Ang Li, A. Skjellum, Martin C. Herbordt

{"title":"FLASH: fpga加速的GCN智能交换机案例研究","authors":"Pouya Haghi, William Krska, Cheng Tan, Tong Geng, P. Chen, Connor Greenwood, Anqi Guo, Thomas M. Hines, Chunshu Wu, Ang Li, A. Skjellum, Martin C. Herbordt","doi":"10.1145/3577193.3593739","DOIUrl":null,"url":null,"abstract":"Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility. In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"FLASH: FPGA-Accelerated Smart Switches with GCN Case Study\",\"authors\":\"Pouya Haghi, William Krska, Cheng Tan, Tong Geng, P. Chen, Connor Greenwood, Anqi Guo, Thomas M. Hines, Chunshu Wu, Ang Li, A. Skjellum, Martin C. Herbordt\",\"doi\":\"10.1145/3577193.3593739\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility. In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.\",\"PeriodicalId\":424155,\"journal\":{\"name\":\"Proceedings of the 37th International Conference on Supercomputing\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3577193.3593739\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593739","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

一些通信交换机，例如Mellanox SHArP和IBM BlueGene集群中的通信交换机，被增强为在应用程序级别使用固定功能集合处理数据包。然而，这种方法缺乏灵活性，这限制了它们在各种动态工作负载中的适用性。最近，一种使用高级语言(如P4)的新型可编程包处理器已成为可能的候选。然而，基于P4的交换机在某些应用中存在不足，包括机器学习，这些应用目前需要P4不支持的功能。其中包括更复杂的计算，如稀疏计算和融合乘法累加、数据密集型浮点操作、数据重用和大量内存。这里要解决的问题是，这样的交换机扩展需要支持:大量的状态、重要的灵活计算能力和易于编程，同时保持完整的功能，包括确保高吞吐量和演示实用性。在这项工作中，我们提出了一种可编程的侧面型加速器，它可以嵌入或连接到现有的通信交换管道中，并且能够以线路速率处理数据包。提议的交换加速器是基于混合ISA (RISC-V指令的子集)和数据流图(在CGRAs中发现)。为了增强性能，还支持矢量指令。为了便于使用，我们开发了一个完整的工具链，将用户提供的C/ c++代码编译为配置加速器的适当后端指令。虽然这种方法足够灵活，可以支持各种工作负载，但在本文中，我们将图卷积网络(GCNs)作为案例研究。实验结果表明，该方法显著提高了分布式GCN应用的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility. In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 37th International Conference on Supercomputing

自引率

0.00%

发文量