{"title":"探索 LSTM 加速器的能效:嵌入式 FPGA 的参数化架构设计","authors":"Chao Qian, Tianheng Ling, Gregor Schiele","doi":"10.1016/j.sysarc.2024.103181","DOIUrl":null,"url":null,"abstract":"<div><p>Long Short-Term Memory Networks (LSTMs) are pivotal in on-device time series analysis for embedded systems, particularly for managing sensor data streams. Yet, their deployment on resource-constrained embedded devices presents notable challenges. In response, we introduce a novel parameterized architecture for LSTM accelerators designed explicitly for embedded Field-Programmable Gate Arrays (FPGAs). Our approach involves strategic design choices, such as employing computationally efficient activation functions and optimizing clock frequency with a pipelined Arithmetic Logic Unit (ALU). These decisions drive our architecture towards enhanced energy efficiency while maintaining adaptability across diverse application scenarios. A key feature of our architecture is its configurable parameters, which allow for tailored optimization through the optional use of Digital Signal Processor Slices for ALUs and the selective implementation of activation functions. Our empirical evaluations conducted on the <em>Spartan-7 XC7S15</em> FPGA demonstrate the robustness of our methodology, achieving a 2.33<span><math><mo>×</mo></math></span> improvement in energy efficiency over previous solutions. Furthermore, our study examines the correlation between memory resource types and energy efficiency across various LSTM model sizes. Impressively, even with a 9<span><math><mo>×</mo></math></span> increase in the hidden size of the LSTM cell, our accelerator maintains an energy efficiency of 10.03 GOP/s/W, with only a minor decrease of 14.65%. However, it is critical to note that our current design is not yet optimized for larger FPGA models such as the <em>Spartan-7 XC7S25</em> and <em>XC7S50</em>. For these models, timing constraints, rather than resource limitations, pose challenges to scaling, highlighting a potential area for future optimization.</p></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"152 ","pages":"Article 103181"},"PeriodicalIF":3.7000,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1383762124001188/pdfft?md5=7824b2a17822bc51bc5d88b475a6970e&pid=1-s2.0-S1383762124001188-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Exploring energy efficiency of LSTM accelerators: A parameterized architecture design for embedded FPGAs\",\"authors\":\"Chao Qian, Tianheng Ling, Gregor Schiele\",\"doi\":\"10.1016/j.sysarc.2024.103181\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Long Short-Term Memory Networks (LSTMs) are pivotal in on-device time series analysis for embedded systems, particularly for managing sensor data streams. Yet, their deployment on resource-constrained embedded devices presents notable challenges. In response, we introduce a novel parameterized architecture for LSTM accelerators designed explicitly for embedded Field-Programmable Gate Arrays (FPGAs). Our approach involves strategic design choices, such as employing computationally efficient activation functions and optimizing clock frequency with a pipelined Arithmetic Logic Unit (ALU). These decisions drive our architecture towards enhanced energy efficiency while maintaining adaptability across diverse application scenarios. A key feature of our architecture is its configurable parameters, which allow for tailored optimization through the optional use of Digital Signal Processor Slices for ALUs and the selective implementation of activation functions. Our empirical evaluations conducted on the <em>Spartan-7 XC7S15</em> FPGA demonstrate the robustness of our methodology, achieving a 2.33<span><math><mo>×</mo></math></span> improvement in energy efficiency over previous solutions. Furthermore, our study examines the correlation between memory resource types and energy efficiency across various LSTM model sizes. Impressively, even with a 9<span><math><mo>×</mo></math></span> increase in the hidden size of the LSTM cell, our accelerator maintains an energy efficiency of 10.03 GOP/s/W, with only a minor decrease of 14.65%. However, it is critical to note that our current design is not yet optimized for larger FPGA models such as the <em>Spartan-7 XC7S25</em> and <em>XC7S50</em>. For these models, timing constraints, rather than resource limitations, pose challenges to scaling, highlighting a potential area for future optimization.</p></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"152 \",\"pages\":\"Article 103181\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1383762124001188/pdfft?md5=7824b2a17822bc51bc5d88b475a6970e&pid=1-s2.0-S1383762124001188-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762124001188\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762124001188","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Exploring energy efficiency of LSTM accelerators: A parameterized architecture design for embedded FPGAs
Long Short-Term Memory Networks (LSTMs) are pivotal in on-device time series analysis for embedded systems, particularly for managing sensor data streams. Yet, their deployment on resource-constrained embedded devices presents notable challenges. In response, we introduce a novel parameterized architecture for LSTM accelerators designed explicitly for embedded Field-Programmable Gate Arrays (FPGAs). Our approach involves strategic design choices, such as employing computationally efficient activation functions and optimizing clock frequency with a pipelined Arithmetic Logic Unit (ALU). These decisions drive our architecture towards enhanced energy efficiency while maintaining adaptability across diverse application scenarios. A key feature of our architecture is its configurable parameters, which allow for tailored optimization through the optional use of Digital Signal Processor Slices for ALUs and the selective implementation of activation functions. Our empirical evaluations conducted on the Spartan-7 XC7S15 FPGA demonstrate the robustness of our methodology, achieving a 2.33 improvement in energy efficiency over previous solutions. Furthermore, our study examines the correlation between memory resource types and energy efficiency across various LSTM model sizes. Impressively, even with a 9 increase in the hidden size of the LSTM cell, our accelerator maintains an energy efficiency of 10.03 GOP/s/W, with only a minor decrease of 14.65%. However, it is critical to note that our current design is not yet optimized for larger FPGA models such as the Spartan-7 XC7S25 and XC7S50. For these models, timing constraints, rather than resource limitations, pose challenges to scaling, highlighting a potential area for future optimization.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.