Pub Date : 2023-03-12DOI: 10.1109/LCA.2023.3275909
Samer Kurzum;Gil Shomron;Freddy Gabbay;Uri Weiser
Deep neural networks (DNNs) require abundant multiply-and-accumulate (MAC) operations. Thanks to DNNs’ ability to accommodate noise, some of the computational burden is commonly mitigated by quantization–that is, by using lower precision floating-point operations. Layer granularity is the preferred method, as it is easily mapped to commodity hardware. In this paper, we propose Dynamic Asymmetric Architecture (DAA), in which the micro-architecture decides what the precision of each MAC operation should be during runtime. We demonstrate a DAA with two data streams and a value-based controller that decides which data stream deserves the higher precision resource. We evaluate this mechanism in terms of accuracy on a number of convolutional neural networks (CNNs) and demonstrate its feasibility on top of a systolic array. Our experimental analysis shows that DAA potentially achieves 2x throughput improvement for ResNet-18 while saving 35% of the energy with less than 0.5% degradation in accuracy.
{"title":"Enhancing DNN Training Efficiency Via Dynamic Asymmetric Architecture","authors":"Samer Kurzum;Gil Shomron;Freddy Gabbay;Uri Weiser","doi":"10.1109/LCA.2023.3275909","DOIUrl":"10.1109/LCA.2023.3275909","url":null,"abstract":"Deep neural networks (DNNs) require abundant multiply-and-accumulate (MAC) operations. Thanks to DNNs’ ability to accommodate noise, some of the computational burden is commonly mitigated by quantization–that is, by using lower precision floating-point operations. Layer granularity is the preferred method, as it is easily mapped to commodity hardware. In this paper, we propose Dynamic Asymmetric Architecture (DAA), in which the micro-architecture decides what the precision of each MAC operation should be during runtime. We demonstrate a DAA with two data streams and a value-based controller that decides which data stream deserves the higher precision resource. We evaluate this mechanism in terms of accuracy on a number of convolutional neural networks (CNNs) and demonstrate its feasibility on top of a systolic array. Our experimental analysis shows that DAA potentially achieves 2x throughput improvement for ResNet-18 while saving 35% of the energy with less than 0.5% degradation in accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2023-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45178350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-10DOI: 10.1109/LCA.2023.3274931
Pengzhou He;Yazheng Tu;Çetin Kaya Koç;Jiafeng Xie
Large integer polynomial multiplication is frequently used as a key component in post-quantum cryptography (PQC) algorithms. Following the trend that efficient hardware implementation for PQC is emphasized, in this letter, we propose a new hardware-implemented lightweight accelerator for the large integer polynomial multiplication of Saber (one of the National Institute of Standards and Technology third-round finalists). First, we provided a derivation process to obtain the algorithm for the targeted polynomial multiplication. Then, the proposed algorithm is mapped into an optimized hardware accelerator. Finally, we demonstrated the efficiency of the proposed design, e.g., this accelerator with $v=32$