tinyML applications increasingly operate in dynamically changing deployment scenarios, requiring optimizing for both accuracy and latency. Existing methods mainly target a single point in the accuracy/latency tradeoff space—insufficient as no single static point can be optimal under variable conditions. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that activates different SubNets within a SuperNet. This creates an opportunity to exploit the inherent temporal locality of different queries that use the same SuperNet. We propose a hardware-software co-design called SUSHI that introduces a novel SubGraph Stationary optimization. SUSHI consists of a novel FPGA implementation and a software scheduler that controls which SubNets to serve and what SubGraph to cache in real-time. SUSHI yields up to 32% improvement in latency, 0.98% increase in served accuracy, and achieves up to 78.7% saved off-chip energy across several neural network architectures.
{"title":"Hardware-Software co-design for real-time latency-accuracy navigation in <i>tinyML</i> applications","authors":"Payman Behnam, Jianming Tong, Alind Khare, Yangyu Chen, Yue Pan, Pranav Gadikar, Abhimanyu Bambhaniya, Tushar Krishna, Alexey Tumanov","doi":"10.1109/mm.2023.3317243","DOIUrl":"https://doi.org/10.1109/mm.2023.3317243","url":null,"abstract":"tinyML applications increasingly operate in dynamically changing deployment scenarios, requiring optimizing for both accuracy and latency. Existing methods mainly target a single point in the accuracy/latency tradeoff space—insufficient as no single static point can be optimal under variable conditions. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that activates different SubNets within a SuperNet. This creates an opportunity to exploit the inherent temporal locality of different queries that use the same SuperNet. We propose a hardware-software co-design called SUSHI that introduces a novel SubGraph Stationary optimization. SUSHI consists of a novel FPGA implementation and a software scheduler that controls which SubNets to serve and what SubGraph to cache in real-time. SUSHI yields up to 32% improvement in latency, 0.98% increase in served accuracy, and achieves up to 78.7% saved off-chip energy across several neural network architectures.","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"12 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A neuromorphic architecture is suitable for low-power tiny-ML processors. However, the large number of synapses utilized in recent deep neural networks require multi-chip implementation, resulting in large power consumption due to chip-to-chip interfaces. Here, we present a 10.7-µJ/frame single-chip neuromorphic FPGA processor. To reduce the required hardware resources, we have developed two techniques. The first is a dendrite-inspired nonlinear neural network (dNNN) that mimics various nonlinear functions of dendrite spines in the human cerebrum. The second is a line scan-based architecture that reduces the total amount of hardware resources. The 14-layer convolutional neural network, which achieves an 88% accuracy with the CIFAR-10 dataset, was implemented on a single FPGA board. Compared to a state-of-the-art spiking CNNbased neuromorphic FPGA processor, the energy efficiency of the proposed architecture is improved by a factor of 94.4 while achieving a 6% better classification accuracy.
{"title":"A 10.7-µJ/frame 88% Accuracy CIFAR-10 Single-chip Neuromorphic FPGA Processor Featuring Various Nonlinear Functions of Dendrites in Human Cerebrum","authors":"Atsutake Kosuge, Yao-Chung Hsu, Rei Sumikawa, Mototsugu Hamada, Tadahiro Kuroda, Tomoe Ishikawa","doi":"10.1109/mm.2023.3315676","DOIUrl":"https://doi.org/10.1109/mm.2023.3315676","url":null,"abstract":"A neuromorphic architecture is suitable for low-power tiny-ML processors. However, the large number of synapses utilized in recent deep neural networks require multi-chip implementation, resulting in large power consumption due to chip-to-chip interfaces. Here, we present a 10.7-µJ/frame single-chip neuromorphic FPGA processor. To reduce the required hardware resources, we have developed two techniques. The first is a dendrite-inspired nonlinear neural network (dNNN) that mimics various nonlinear functions of dendrite spines in the human cerebrum. The second is a line scan-based architecture that reduces the total amount of hardware resources. The 14-layer convolutional neural network, which achieves an 88% accuracy with the CIFAR-10 dataset, was implemented on a single FPGA board. Compared to a state-of-the-art spiking CNNbased neuromorphic FPGA processor, the energy efficiency of the proposed architecture is improved by a factor of 94.4 while achieving a 6% better classification accuracy.","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"28 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew D. Sinclair, Parthasarathy Ranganathan, Gaurang Upasani, Adrian Sampson, David Patterson, Rutwik Jain, Nidhi Parthasarathy, Shaan Shah
2023 marked the fiftieth year of the International Symposium on Computer Architecture (ISCA). As one of the oldest and preeminent computer architecture conferences, ISCA represents a microcosm of the broader community; correspondingly, a 50-year-retrospective offers us a great way to track the impact and evolution of the field. Analyzing the content and impact of all the papers published at ISCA so far, we show how computer architecture research has been at the forefront of advances that have driven the broader computing ecosystem. Decadal trends show a dynamic and rapidly-evolving field, with diverse contributions. Examining how the most highly-cited papers achieve their popularity reveals interesting trends on technology adoption curves and the path to impact. Our data also highlights a growing and thriving community, with interesting insights on diversity and scale. We conclude with a summary of the celebratory panel held at ISCA, with observations on the exciting future ahead.
{"title":"Fifty Years of the International Symposium on Computer Architecture: A Data-Driven Retrospective","authors":"Matthew D. Sinclair, Parthasarathy Ranganathan, Gaurang Upasani, Adrian Sampson, David Patterson, Rutwik Jain, Nidhi Parthasarathy, Shaan Shah","doi":"10.1109/mm.2023.3324465","DOIUrl":"https://doi.org/10.1109/mm.2023.3324465","url":null,"abstract":"2023 marked the fiftieth year of the International Symposium on Computer Architecture (ISCA). As one of the oldest and preeminent computer architecture conferences, ISCA represents a microcosm of the broader community; correspondingly, a 50-year-retrospective offers us a great way to track the impact and evolution of the field. Analyzing the content and impact of all the papers published at ISCA so far, we show how computer architecture research has been at the forefront of advances that have driven the broader computing ecosystem. Decadal trends show a dynamic and rapidly-evolving field, with diverse contributions. Examining how the most highly-cited papers achieve their popularity reveals interesting trends on technology adoption curves and the path to impact. Our data also highlights a growing and thriving community, with interesting insights on diversity and scale. We conclude with a summary of the celebratory panel held at ISCA, with observations on the exciting future ahead.","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"78 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135565315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computing in Science & Engineering","authors":"","doi":"10.1109/mm.2023.3324749","DOIUrl":"https://doi.org/10.1109/mm.2023.3324749","url":null,"abstract":"","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"41 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135567033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A previous article in this series showed that the correlation between the prosecution time and the number of claims was relatively low. This article further analyzes that correlation by examining the effect that patent class has.
{"title":"Analysis of Historical Patenting Behavior and Patent Characteristics of Computer Architecture Companies—Part VII: Relationship Between Prosecution Time and Claims","authors":"Joshua J. Yi","doi":"10.1109/mm.2023.3320318","DOIUrl":"https://doi.org/10.1109/mm.2023.3320318","url":null,"abstract":"A previous article in this series showed that the correlation between the prosecution time and the number of claims was relatively low. This article further analyzes that correlation by examining the effect that patent class has.","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135565316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Computer Society Volunteer Service Awards","authors":"","doi":"10.1109/mm.2023.3327852","DOIUrl":"https://doi.org/10.1109/mm.2023.3327852","url":null,"abstract":"","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"135 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135565328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marius Brehler, Lucas Camphausen, Benjamin Heidebroek, Dennis Krön, Henri Gründer, Simon Camphausen
Processing data close to the sensor on a low-cost, low power embedded device has the potential to unlock new areas for machine learning (ML). Whether it is possible to deploy such ML applications or not depends on the energy efficiency of the solution. One way to realize a lower energy consumption is to bring the application as close as possible to the sensor. We demonstrate the concept of transforming an ML application running near to the sensor into a hybrid near-sensor in-sensor application. This approach aims to reduce the overall energy consumption and we showcase it using a motion classification example, which can be considered as a simpler sub-problem of activity recognition. The reduction of energy consumption is achieved by combining a convolutional neural network with a decision tree. Both applications are compared in terms of accuracy and energy consumption, illustrating the benefits of the hybrid approach.
{"title":"Making Machine Learning More Energy Efficient by Bringing it Closer to the Sensor","authors":"Marius Brehler, Lucas Camphausen, Benjamin Heidebroek, Dennis Krön, Henri Gründer, Simon Camphausen","doi":"10.1109/mm.2023.3316348","DOIUrl":"https://doi.org/10.1109/mm.2023.3316348","url":null,"abstract":"Processing data close to the sensor on a low-cost, low power embedded device has the potential to unlock new areas for machine learning (ML). Whether it is possible to deploy such ML applications or not depends on the energy efficiency of the solution. One way to realize a lower energy consumption is to bring the application as close as possible to the sensor. We demonstrate the concept of transforming an ML application running near to the sensor into a hybrid near-sensor in-sensor application. This approach aims to reduce the overall energy consumption and we showcase it using a motion classification example, which can be considered as a simpler sub-problem of activity recognition. The reduction of energy consumption is achieved by combining a convolutional neural network with a decision tree. Both applications are compared in terms of accuracy and energy consumption, illustrating the benefits of the hybrid approach.","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"314 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Computer Society Career Center","authors":"","doi":"10.1109/mm.2023.3322209","DOIUrl":"https://doi.org/10.1109/mm.2023.3322209","url":null,"abstract":"","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135565324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vivek Parmar, Syed Shakib Sarwar, Ziyun Li, Hsien-Hsin S. Lee, Barbara De Salvo, Manan Suri
Low-Power Edge-AI capabilities are essential for on-device extended reality (XR) applications to support the vision of Metaverse. In this work, we investigate two representative XR workloads: (i) Hand detection and (ii) Eye segmentation, for hardware design space exploration. For both applications, we train deep neural networks and analyze the impact of quantization and hardware-specific bottlenecks. Through simulations, we evaluate a CPU and two systolic inference accelerator implementations. Next, we compare these hardware solutions with advanced technology nodes. The impact of integrating state-of-the-art emerging non-volatile memory (NVM) technology (STT/SOT/VGSOT MRAM) into the XR-AI inference pipeline is evaluated. We found that significant energy benefits (≥24%) can be achieved for hand detection (IPS=10) and eye segmentation (IPS=0.1) by introducing NVM in the memory hierarchy for designs at 7nm node while meeting minimum IPS (inference per second). Moreover, we can realize substantial reduction in area (≥30%) owing to the small form factor of MRAM.
{"title":"Exploring Memory-Oriented Design Optimization of Edge-AI Hardware for Extended Reality Applications","authors":"Vivek Parmar, Syed Shakib Sarwar, Ziyun Li, Hsien-Hsin S. Lee, Barbara De Salvo, Manan Suri","doi":"10.1109/mm.2023.3321249","DOIUrl":"https://doi.org/10.1109/mm.2023.3321249","url":null,"abstract":"Low-Power Edge-AI capabilities are essential for on-device extended reality (XR) applications to support the vision of Metaverse. In this work, we investigate two representative XR workloads: (i) Hand detection and (ii) Eye segmentation, for hardware design space exploration. For both applications, we train deep neural networks and analyze the impact of quantization and hardware-specific bottlenecks. Through simulations, we evaluate a CPU and two systolic inference accelerator implementations. Next, we compare these hardware solutions with advanced technology nodes. The impact of integrating state-of-the-art emerging non-volatile memory (NVM) technology (STT/SOT/VGSOT MRAM) into the XR-AI inference pipeline is evaluated. We found that significant energy benefits (≥24%) can be achieved for hand detection (IPS=10) and eye segmentation (IPS=0.1) by introducing NVM in the memory hierarchy for designs at 7nm node while meeting minimum IPS (inference per second). Moreover, we can realize substantial reduction in area (≥30%) owing to the small form factor of MRAM.","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"51 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135012274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This IEEE Micro special issue on tiny machine learning (TinyML) explores cutting-edge research on optimizing machine learning models for highly resource-constrained devices like microcontrollers and embedded systems. The articles cover techniques across the full TinyML stack, including efficient neural network design, on-device learning, model compression, hardware–software co-design, and specialized applications. These selected works showcase techniques to enable increasingly sophisticated intelligence on low-power, memory-constrained edge devices. They provide valuable insights to overcome challenges in deploying performant yet compact TinyML solutions that can perceive, reason, and interact intelligently, even at the very edge.
{"title":"Special Issue on TinyML","authors":"Vijay Janapa Reddi, Boris Murmann","doi":"10.1109/mm.2023.3322048","DOIUrl":"https://doi.org/10.1109/mm.2023.3322048","url":null,"abstract":"This IEEE Micro special issue on tiny machine learning (TinyML) explores cutting-edge research on optimizing machine learning models for highly resource-constrained devices like microcontrollers and embedded systems. The articles cover techniques across the full TinyML stack, including efficient neural network design, on-device learning, model compression, hardware–software co-design, and specialized applications. These selected works showcase techniques to enable increasingly sophisticated intelligence on low-power, memory-constrained edge devices. They provide valuable insights to overcome challenges in deploying performant yet compact TinyML solutions that can perceive, reason, and interact intelligently, even at the very edge.","PeriodicalId":13100,"journal":{"name":"IEEE Micro","volume":"89 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135565312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}