{"title":"Hamamu:通过添加硬矩阵乘法器块,专门为ML应用提供fpga","authors":"Aman Arora, Zhigang Wei, L. John","doi":"10.1109/ASAP49362.2020.00018","DOIUrl":null,"url":null,"abstract":"Designing efficient hardware for accelerating artificial intelligence (AI) and machine learning (ML) applications is a major challenge. Rapidly changing algorithms and neural network architectures make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (Logic Blocks (LBs), multipliers, DSP blocks) limit the acceleration that can be achieved. We propose Hamamu, a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighboring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes ($2\\times 2\\times 2$, $4\\times 4\\times 4$, $8\\times 8\\times 8$, $16\\times 16\\times 16$) and various strategies to place these blocks on the FPGA (Columnar, Surround, Hybrid). We find that providing $4\\times 4\\times 4$ hard matrix multiplier blocks in an FPGA speeds up neural networks from MLPerf benchmarks by up to $\\sim 3.9x$, compared to a Stratix-10 like FPGA with equal number of MACs, same MAC architecture and high DSP:LB ratio. Although the flexibility of the FPGA will reduce for non-ML applications, an FPGA with hard matrix multipliers is a faster, and more area efficient hardware accelerator for ML applications, compared to current FPGAs.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Hamamu: Specializing FPGAs for ML Applications by Adding Hard Matrix Multiplier Blocks\",\"authors\":\"Aman Arora, Zhigang Wei, L. John\",\"doi\":\"10.1109/ASAP49362.2020.00018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Designing efficient hardware for accelerating artificial intelligence (AI) and machine learning (ML) applications is a major challenge. Rapidly changing algorithms and neural network architectures make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (Logic Blocks (LBs), multipliers, DSP blocks) limit the acceleration that can be achieved. We propose Hamamu, a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighboring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes ($2\\\\times 2\\\\times 2$, $4\\\\times 4\\\\times 4$, $8\\\\times 8\\\\times 8$, $16\\\\times 16\\\\times 16$) and various strategies to place these blocks on the FPGA (Columnar, Surround, Hybrid). We find that providing $4\\\\times 4\\\\times 4$ hard matrix multiplier blocks in an FPGA speeds up neural networks from MLPerf benchmarks by up to $\\\\sim 3.9x$, compared to a Stratix-10 like FPGA with equal number of MACs, same MAC architecture and high DSP:LB ratio. Although the flexibility of the FPGA will reduce for non-ML applications, an FPGA with hard matrix multipliers is a faster, and more area efficient hardware accelerator for ML applications, compared to current FPGAs.\",\"PeriodicalId\":375691,\"journal\":{\"name\":\"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASAP49362.2020.00018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP49362.2020.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hamamu: Specializing FPGAs for ML Applications by Adding Hard Matrix Multiplier Blocks
Designing efficient hardware for accelerating artificial intelligence (AI) and machine learning (ML) applications is a major challenge. Rapidly changing algorithms and neural network architectures make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (Logic Blocks (LBs), multipliers, DSP blocks) limit the acceleration that can be achieved. We propose Hamamu, a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighboring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes ($2\times 2\times 2$, $4\times 4\times 4$, $8\times 8\times 8$, $16\times 16\times 16$) and various strategies to place these blocks on the FPGA (Columnar, Surround, Hybrid). We find that providing $4\times 4\times 4$ hard matrix multiplier blocks in an FPGA speeds up neural networks from MLPerf benchmarks by up to $\sim 3.9x$, compared to a Stratix-10 like FPGA with equal number of MACs, same MAC architecture and high DSP:LB ratio. Although the flexibility of the FPGA will reduce for non-ML applications, an FPGA with hard matrix multipliers is a faster, and more area efficient hardware accelerator for ML applications, compared to current FPGAs.