Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, D. Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff N. Lowney, A. Herr, C. Hughes, T. Mattson, P. Dubey
{"title":"T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations","authors":"Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, D. Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff N. Lowney, A. Herr, C. Hughes, T. Mattson, P. Dubey","doi":"10.1109/FCCM.2019.00033","DOIUrl":null,"url":null,"abstract":"We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja\") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures.\"","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2019.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 39
Abstract
We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures."