Pub Date : 2023-04-19DOI: 10.1109/coolchips57690.2023.10122101
{"title":"Session VI Panel Discussions","authors":"","doi":"10.1109/coolchips57690.2023.10122101","DOIUrl":"https://doi.org/10.1109/coolchips57690.2023.10122101","url":null,"abstract":"","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129092719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-19DOI: 10.1109/COOLCHIPS57690.2023.10121940
Sangyeob Kim, Soyeon Kim, Seongyon Hong, Sangjin Kim, Donghyeon Han, Jiwon Choi, H. Yoo
This paper presents a low power NPU, COmplementary Online Learning Neural Processing Unit (COOL-NPU) with three key features: 1) low-power forward gradient generation logic with global counter and local gradient unit, 2) skip index generator and sparsity-aware CNN core for neuron-level backpropagation, 3) SNN core with distributed L1 cache to eliminate redundant SRAM access. By using complementary characteristic of CNN and SNN, we achieve 47.7% energy reduction than previous state-of-the-art online learning processor.
{"title":"COOL-NPU: Complementary Online Learning Neural Processing Unit with CNN-SNN Heterogeneous Core and Event-driven Backpropagation","authors":"Sangyeob Kim, Soyeon Kim, Seongyon Hong, Sangjin Kim, Donghyeon Han, Jiwon Choi, H. Yoo","doi":"10.1109/COOLCHIPS57690.2023.10121940","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10121940","url":null,"abstract":"This paper presents a low power NPU, COmplementary Online Learning Neural Processing Unit (COOL-NPU) with three key features: 1) low-power forward gradient generation logic with global counter and local gradient unit, 2) skip index generator and sparsity-aware CNN core for neuron-level backpropagation, 3) SNN core with distributed L1 cache to eliminate redundant SRAM access. By using complementary characteristic of CNN and SNN, we achieve 47.7% energy reduction than previous state-of-the-art online learning processor.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129712571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-19DOI: 10.1109/COOLCHIPS57690.2023.10121996
Viktor Razilov, Juncen Zhong, E. Matús, G. Fettweis
Vector processors execute instructions that manipulate vectors of data items using time-division multiplexing (TDM). Chaining, the pipelined execution of vector instruction, ensures high performance and utilization. When two vectors are loaded sequentially to be the input of a follow-up compute instruction, which is often the case in vector applications, chaining cannot take effect during the duration of the entire first vector load. To close this gap, we propose dual load: A parallel or interleaved load of the two input vectors. We study this feature analytically and make statements on necessary conditions for performance improvements. Our investigation finds that compute-bound and some memory-bound applications profit from this feature when the memory and compute bandwidths are sufficiently high. A speedup of up to 33 % is possible in the ideal case. Our practical implementation shows improvements of up to 21 % with a hardware overhead of less than 2 %.
{"title":"Dual Vector Load for Improved Pipelining in Vector Processors","authors":"Viktor Razilov, Juncen Zhong, E. Matús, G. Fettweis","doi":"10.1109/COOLCHIPS57690.2023.10121996","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10121996","url":null,"abstract":"Vector processors execute instructions that manipulate vectors of data items using time-division multiplexing (TDM). Chaining, the pipelined execution of vector instruction, ensures high performance and utilization. When two vectors are loaded sequentially to be the input of a follow-up compute instruction, which is often the case in vector applications, chaining cannot take effect during the duration of the entire first vector load. To close this gap, we propose dual load: A parallel or interleaved load of the two input vectors. We study this feature analytically and make statements on necessary conditions for performance improvements. Our investigation finds that compute-bound and some memory-bound applications profit from this feature when the memory and compute bandwidths are sufficiently high. A speedup of up to 33 % is possible in the ideal case. Our practical implementation shows improvements of up to 21 % with a hardware overhead of less than 2 %.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133209364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}