Wenqiang Wang, Kaiyuan Guo, Mengyuan Gu, Yuchun Ma, Yu Wang
{"title":"A universal FPGA-based floating-point matrix processor for mobile systems","authors":"Wenqiang Wang, Kaiyuan Guo, Mengyuan Gu, Yuchun Ma, Yu Wang","doi":"10.1109/FPT.2014.7082766","DOIUrl":null,"url":null,"abstract":"FPGA-based acceleration of matrix operations is a promising solution in mobile systems. However, most related work focuses on a certain operation instead of a complete system. In this paper, we explore the possibility of integrating multiple matrix accelerators with a master processor and propose a universal floating-point matrix processor. The processor supports multiple matrix-matrix operations (Level 3 BLAS) and the matrix size is unlimited. The key component of the processor is a shared matrix cache which enables on-chip communication between different accelerators. This structure reduces the external memory bandwidth requirement and improves the overall performance. Considering the performance of the whole system, an asynchronous instruction execution mechanism is further proposed in the hardware-software interface so as to reduce the workload of the master processor. We demonstrate the system using a DE3 develop board and achieve a computing performance of about 19 GFLOPS. Experiments show the proposed processor achieves higher performance and energy efficiency than some state-of-the-art embedded processors including ARM cortex A9 and NIOS Il/f soft-core processor. The performance of the processor is even comparable to some desktop processors.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"51 1","pages":"139-146"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2014.7082766","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
FPGA-based acceleration of matrix operations is a promising solution in mobile systems. However, most related work focuses on a certain operation instead of a complete system. In this paper, we explore the possibility of integrating multiple matrix accelerators with a master processor and propose a universal floating-point matrix processor. The processor supports multiple matrix-matrix operations (Level 3 BLAS) and the matrix size is unlimited. The key component of the processor is a shared matrix cache which enables on-chip communication between different accelerators. This structure reduces the external memory bandwidth requirement and improves the overall performance. Considering the performance of the whole system, an asynchronous instruction execution mechanism is further proposed in the hardware-software interface so as to reduce the workload of the master processor. We demonstrate the system using a DE3 develop board and achieve a computing performance of about 19 GFLOPS. Experiments show the proposed processor achieves higher performance and energy efficiency than some state-of-the-art embedded processors including ARM cortex A9 and NIOS Il/f soft-core processor. The performance of the processor is even comparable to some desktop processors.