Yen-Ting Chen, Han-Xiang Liu, Yuan-Hao Chang, Yu-Pei Liang, W. Shih
Intermittent systems are usually energy-harvesting embedded systems that harvest energy from ambient environment and perform computation intermittently. Due to the unreliable power, these intermittent systems typically adopt different checkpointing strategies for ensuring the data consistency and execution progress after the systems are resumed from unpredictable power failures. Existing checkpointing strategies are usually suitable for bare-metal intermittent systems with short run time. Due to the improvement of energy-harvesting techniques, intermittent systems are having longer run time and better computation power, so that more and more intermittent systems tend to function with a microkernel for handling more/multiple tasks at the same time. However, existing checkpointing strategies were not designed for (or aware of) such microkernel-based intermittent systems that support the running of multiple tasks, and thus have poor performance on preserving the execution progress. To tackle this issue, we propose a design, called self-adaptive checkpointing strategy (SACS), tailored for microkernel-based intermittent systems. By leveraging the time-slicing scheduler, the proposed design dynamically adjust the checkpointing interval at both run time and reboot time, so as to improve the system performance by achieving a good balance between the execution progress and the number of performed checkpoints. A series of experiments was conducted based on a development board of Texas Instrument (TI) with well-known benchmarks. Compared to the state-of-the-art designs, experiment results show that our design could reduce the execution time by at least 46.8% under different conditions of ambient environment while maintaining the number of performed checkpoints in an acceptable scale.
{"title":"SACS: A Self-Adaptive Checkpointing Strategy for Microkernel-Based Intermittent Systems","authors":"Yen-Ting Chen, Han-Xiang Liu, Yuan-Hao Chang, Yu-Pei Liang, W. Shih","doi":"10.1145/3531437.3539705","DOIUrl":"https://doi.org/10.1145/3531437.3539705","url":null,"abstract":"Intermittent systems are usually energy-harvesting embedded systems that harvest energy from ambient environment and perform computation intermittently. Due to the unreliable power, these intermittent systems typically adopt different checkpointing strategies for ensuring the data consistency and execution progress after the systems are resumed from unpredictable power failures. Existing checkpointing strategies are usually suitable for bare-metal intermittent systems with short run time. Due to the improvement of energy-harvesting techniques, intermittent systems are having longer run time and better computation power, so that more and more intermittent systems tend to function with a microkernel for handling more/multiple tasks at the same time. However, existing checkpointing strategies were not designed for (or aware of) such microkernel-based intermittent systems that support the running of multiple tasks, and thus have poor performance on preserving the execution progress. To tackle this issue, we propose a design, called self-adaptive checkpointing strategy (SACS), tailored for microkernel-based intermittent systems. By leveraging the time-slicing scheduler, the proposed design dynamically adjust the checkpointing interval at both run time and reboot time, so as to improve the system performance by achieving a good balance between the execution progress and the number of performed checkpoints. A series of experiments was conducted based on a development board of Texas Instrument (TI) with well-known benchmarks. Compared to the state-of-the-art designs, experiment results show that our design could reduce the execution time by at least 46.8% under different conditions of ambient environment while maintaining the number of performed checkpoints in an acceptable scale.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132162150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emerging Non-volatile memories are considered as potential candidates for replacing traditional DRAM in main memory. However, downsides like long write latency, high write energy, and low write endurance make their direct adoption in the memory hierarchy challenging. Approaches that reduce the number of bits written are beneficial to overcome such drawbacks. In this direction, we propose a compression technique that reduces overall bits written to the NVM, thus improving its lifetime. The proposed method, SIBR, compresses the incoming blocks to PCM by either eliminating the words to be written or by reducing the number of bits written for each word. For the former, words that have either zero content or are identical to consecutive words are not written. The latter is done by computing the difference of each word with a base word and storing only the difference (or delta) instead of the full word. The novelty of our contribution is to update the base word at run-time, thus achieving better compression. It is shown that computing the delta with a dynamically decided base compared to a fixed base gives smaller delta values. The dynamic base is another word in the same block. SIBR outperforms two state-of-the-art compression techniques by achieving a fairly low compression ratio and high coverage. Experimental results show a substantial reduction in bit-flips and improvement in lifetime.
{"title":"Exploiting successive identical words and differences with dynamic bases for effective compression in Non-Volatile Memories","authors":"Swati Upadhyay, Arijit Nath, H. Kapoor","doi":"10.1145/3531437.3539716","DOIUrl":"https://doi.org/10.1145/3531437.3539716","url":null,"abstract":"Emerging Non-volatile memories are considered as potential candidates for replacing traditional DRAM in main memory. However, downsides like long write latency, high write energy, and low write endurance make their direct adoption in the memory hierarchy challenging. Approaches that reduce the number of bits written are beneficial to overcome such drawbacks. In this direction, we propose a compression technique that reduces overall bits written to the NVM, thus improving its lifetime. The proposed method, SIBR, compresses the incoming blocks to PCM by either eliminating the words to be written or by reducing the number of bits written for each word. For the former, words that have either zero content or are identical to consecutive words are not written. The latter is done by computing the difference of each word with a base word and storing only the difference (or delta) instead of the full word. The novelty of our contribution is to update the base word at run-time, thus achieving better compression. It is shown that computing the delta with a dynamically decided base compared to a fixed base gives smaller delta values. The dynamic base is another word in the same block. SIBR outperforms two state-of-the-art compression techniques by achieving a fairly low compression ratio and high coverage. Experimental results show a substantial reduction in bit-flips and improvement in lifetime.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122362965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suwan Kim, Sehyeon Chung, Taewhan Kim, Heechun Park
Monolithic 3D (M3D) is a revolutionary technology for high-density and high-performance chip design in the post-Moore era. However, it suffers from considerable thermal confinement due to the transistor stacking and insulating materials between the layers. As a way of reducing power, thereby mitigating the thermal problem, we propose a comprehensive physical design methodology that incorporates two new important items, one is blockage aware MIV (monolithic inter-tier via) placement and the other is 3D net ordering for routing, intending to optimize wire length. Precisely, we propose a three-step approach: (1) retrieving the MIV region candidates for each 3D net, (2) fine-tuning placement to secure MIV spots in the presence of blockages, and (3) performing M3D routing with net ordering to consider the fine-tuned placement result. We implement the proposed M3D design flow by utilizing commercial 2D IC EDA tools while providing seamless optimization for cross-tier connections. In the meantime, our experiments confirm that proposed M3D design flow saves wire length per cross-tier net by up to 41.42%, which corresponds to 7.68% less total net switching power, equivalently 36.79% lower energy-delay-product over the conventional state-of-the-art M3D design flow.
{"title":"Tightly Linking 3D Via Allocation Towards Routing Optimization for Monolithic 3D ICs","authors":"Suwan Kim, Sehyeon Chung, Taewhan Kim, Heechun Park","doi":"10.1145/3531437.3539714","DOIUrl":"https://doi.org/10.1145/3531437.3539714","url":null,"abstract":"Monolithic 3D (M3D) is a revolutionary technology for high-density and high-performance chip design in the post-Moore era. However, it suffers from considerable thermal confinement due to the transistor stacking and insulating materials between the layers. As a way of reducing power, thereby mitigating the thermal problem, we propose a comprehensive physical design methodology that incorporates two new important items, one is blockage aware MIV (monolithic inter-tier via) placement and the other is 3D net ordering for routing, intending to optimize wire length. Precisely, we propose a three-step approach: (1) retrieving the MIV region candidates for each 3D net, (2) fine-tuning placement to secure MIV spots in the presence of blockages, and (3) performing M3D routing with net ordering to consider the fine-tuned placement result. We implement the proposed M3D design flow by utilizing commercial 2D IC EDA tools while providing seamless optimization for cross-tier connections. In the meantime, our experiments confirm that proposed M3D design flow saves wire length per cross-tier net by up to 41.42%, which corresponds to 7.68% less total net switching power, equivalently 36.79% lower energy-delay-product over the conventional state-of-the-art M3D design flow.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"307 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132937372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Lu, Shamik Kundu, Abraham Peedikayil Kuruvila, Supriya Margabandhu Ravichandran, K. Basu
The advent of quantum computing has engendered a widespread proliferation of efforts utilizing qubits for optimizing classical computational algorithms. Number Theoretic Transform (NTT) is one such popular algorithm that accelerates polynomial multiplication significantly and is consequently, the core arithmetic operation in most homomorphic encryption algorithms. Hence, fast and efficient execution of NTT is highly imperative for practical implementation of homomorphic encryption schemes in different computing paradigms. In this paper, we, for the first time, propose an efficient and scalable Quantum Number Theoretic Transform (QNTT) circuit using quantum gates. We introduce a novel exponential unit for modular exponential operation, which furnishes an algorithmic complexity of O(n). Our proposed methodology performs further optimization and logic synthesis of QNTT, that is significantly fast and facilitates efficient implementations on IBM’s quantum computers. The optimized QNTT achieves a gate-level complexity reduction from power of two to one with respect to bit length. Our methodology utilizes 44.2% fewer gates, thereby minimizing the circuit depth and a corresponding reduction in overhead and error probability, for a 4-point QNTT compared to its unoptimized counterpart.
{"title":"Design and Logic Synthesis of a Scalable, Efficient Quantum Number Theoretic Transform","authors":"Chao Lu, Shamik Kundu, Abraham Peedikayil Kuruvila, Supriya Margabandhu Ravichandran, K. Basu","doi":"10.1145/3531437.3543827","DOIUrl":"https://doi.org/10.1145/3531437.3543827","url":null,"abstract":"The advent of quantum computing has engendered a widespread proliferation of efforts utilizing qubits for optimizing classical computational algorithms. Number Theoretic Transform (NTT) is one such popular algorithm that accelerates polynomial multiplication significantly and is consequently, the core arithmetic operation in most homomorphic encryption algorithms. Hence, fast and efficient execution of NTT is highly imperative for practical implementation of homomorphic encryption schemes in different computing paradigms. In this paper, we, for the first time, propose an efficient and scalable Quantum Number Theoretic Transform (QNTT) circuit using quantum gates. We introduce a novel exponential unit for modular exponential operation, which furnishes an algorithmic complexity of O(n). Our proposed methodology performs further optimization and logic synthesis of QNTT, that is significantly fast and facilitates efficient implementations on IBM’s quantum computers. The optimized QNTT achieves a gate-level complexity reduction from power of two to one with respect to bit length. Our methodology utilizes 44.2% fewer gates, thereby minimizing the circuit depth and a corresponding reduction in overhead and error probability, for a 4-point QNTT compared to its unoptimized counterpart.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129548381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper addresses the combined problem of the three core tasks, namely routing on the middle-of-line (MOL) layer, generating I/O pin patterns (PP), and allocating contacts over active gates (COAG) in cell layout synthesis with 7nm and below technology. As yet, the existing cell layout generators have paid partial or little attention to those tasks, even with no awareness of the synergistic effects. This work overcomes this limitation by proposing a systematic and tightly-linked solution to the combined problem to boost the synergistic effects on chip implementation. Precisely, we solve the problem in three steps: (1) fully utilizing the horizontal routing resource on MOL layer by formulating the problem of in-cell routing into a weighted interval scheduling problem, (2) simultaneously performing the remaining horizontal in-cell routing and PP generation on metal 1 layer through the COAG exploitation while ensuring the pin accessibility constraint, and (3) completing in-cell routing by allocating vertical routing resource on MOL layer. Through experiments with benchmark designs, it is shown that our proposed layout method is able to generate standard cells with on average 34.2% shorter total length of metal 1 wire while retaining pin patterns that ensure pin accessibility, resulting in the chip implementations with up to 72.5% timing slack improvement and up to 15.6% power reduction that produced by using the conventional best available cells. In addition, by using less wire and vias, our in-cell router is able to consistently reduce the worst delay of cells, noticeably, reducing the sum of setup time and clock-to-Q delay of flip-flops by 1.2% ∼ 3.0% on average over that by the existing best cells.
{"title":"Improving Performance and Power by Co-Optimizing Middle-of-Line Routing, Pin Pattern Generation, and Contact over Active Gates in Standard Cell Layout Synthesis","authors":"Sehyeon Chung, Jooyeon Jeong, Taewhan Kim","doi":"10.1145/3531437.3539712","DOIUrl":"https://doi.org/10.1145/3531437.3539712","url":null,"abstract":"This paper addresses the combined problem of the three core tasks, namely routing on the middle-of-line (MOL) layer, generating I/O pin patterns (PP), and allocating contacts over active gates (COAG) in cell layout synthesis with 7nm and below technology. As yet, the existing cell layout generators have paid partial or little attention to those tasks, even with no awareness of the synergistic effects. This work overcomes this limitation by proposing a systematic and tightly-linked solution to the combined problem to boost the synergistic effects on chip implementation. Precisely, we solve the problem in three steps: (1) fully utilizing the horizontal routing resource on MOL layer by formulating the problem of in-cell routing into a weighted interval scheduling problem, (2) simultaneously performing the remaining horizontal in-cell routing and PP generation on metal 1 layer through the COAG exploitation while ensuring the pin accessibility constraint, and (3) completing in-cell routing by allocating vertical routing resource on MOL layer. Through experiments with benchmark designs, it is shown that our proposed layout method is able to generate standard cells with on average 34.2% shorter total length of metal 1 wire while retaining pin patterns that ensure pin accessibility, resulting in the chip implementations with up to 72.5% timing slack improvement and up to 15.6% power reduction that produced by using the conventional best available cells. In addition, by using less wire and vias, our in-cell router is able to consistently reduce the worst delay of cells, noticeably, reducing the sum of setup time and clock-to-Q delay of flip-flops by 1.2% ∼ 3.0% on average over that by the existing best cells.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115408826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To provide data and code confidentiality and reduce the risk of information leak from memory or memory bus, computing systems are enhanced with encryption and decryption engine. Despite massive efforts in designing hardware enhancements for data and code protection, existing solutions incur significant performance overhead as the encryption/decryption is on the critical path. In this paper, we present Sealer, a high-performance and low-overhead in-SRAM memory encryption engine by exploiting the massive parallelism and bitline computational capability of SRAM subarrays. Sealer encrypts data before sending it off-chip and decrypts it upon receiving the memory blocks, thus, providing data confidentiality. Our proposed solution requires only minimal modifications to the existing SRAM peripheral circuitry. Sealer can achieve up to two orders of magnitude throughput-per-area improvement while consuming 3 × less energy compared to prior solutions.
{"title":"Sealer: In-SRAM AES for High-Performance and Low-Overhead Memory Encryption","authors":"Jingyao Zhang, Hoda Naghibijouybari, Elaheh Sadredini","doi":"10.1145/3531437.3539699","DOIUrl":"https://doi.org/10.1145/3531437.3539699","url":null,"abstract":"To provide data and code confidentiality and reduce the risk of information leak from memory or memory bus, computing systems are enhanced with encryption and decryption engine. Despite massive efforts in designing hardware enhancements for data and code protection, existing solutions incur significant performance overhead as the encryption/decryption is on the critical path. In this paper, we present Sealer, a high-performance and low-overhead in-SRAM memory encryption engine by exploiting the massive parallelism and bitline computational capability of SRAM subarrays. Sealer encrypts data before sending it off-chip and decrypts it upon receiving the memory blocks, thus, providing data confidentiality. Our proposed solution requires only minimal modifications to the existing SRAM peripheral circuitry. Sealer can achieve up to two orders of magnitude throughput-per-area improvement while consuming 3 × less energy compared to prior solutions.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129943344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Heo, A. Fayyazi, Amirhossein Esmaili, M. Pedram
This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49 × energy efficiency gain while lowering storage requirements by 3.67 × for total weight storage (non-pruned weights plus indexing) and 22,044 × for indexing memory.
{"title":"Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators","authors":"J. Heo, A. Fayyazi, Amirhossein Esmaili, M. Pedram","doi":"10.1145/3531437.3539715","DOIUrl":"https://doi.org/10.1145/3531437.3539715","url":null,"abstract":"This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49 × energy efficiency gain while lowering storage requirements by 3.67 × for total weight storage (non-pruned weights plus indexing) and 22,044 × for indexing memory.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133640445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alberto Marchisio, Beatrice Bussolino, Edoardo Salvati, M. Martina, G. Masera, M. Shafique
Complex Deep Neural Networks such as Capsule Networks (CapsNets) exhibit high learning capabilities at the cost of compute-intensive operations. To enable their deployment on edge devices, we propose to leverage approximate computing for designing approximate variants of the complex operations like softmax and squash. In our experiments, we evaluate tradeoffs between area, power consumption, and critical path delay of the designs implemented with the ASIC design flow, and the accuracy of the quantized CapsNets, compared to the exact functions.
{"title":"Enabling Capsule Networks at the Edge through Approximate Softmax and Squash Operations","authors":"Alberto Marchisio, Beatrice Bussolino, Edoardo Salvati, M. Martina, G. Masera, M. Shafique","doi":"10.1145/3531437.3539717","DOIUrl":"https://doi.org/10.1145/3531437.3539717","url":null,"abstract":"Complex Deep Neural Networks such as Capsule Networks (CapsNets) exhibit high learning capabilities at the cost of compute-intensive operations. To enable their deployment on edge devices, we propose to leverage approximate computing for designing approximate variants of the complex operations like softmax and squash. In our experiments, we evaluate tradeoffs between area, power consumption, and critical path delay of the designs implemented with the ASIC design flow, and the accuracy of the quantized CapsNets, compared to the exact functions.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114524837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhiroop Bhattacharjee, Youngeun Kim, Abhishek Moitra, P. Panda
Spiking Neural Networks (SNNs) have recently emerged as the low-power alternative to Artificial Neural Networks (ANNs) owing to their asynchronous, sparse, and binary information processing. To improve the energy-efficiency and throughput, SNNs can be implemented on memristive crossbars where Multiply-and-Accumulate (MAC) operations are realized in the analog domain using emerging Non-Volatile-Memory (NVM) devices. Despite the compatibility of SNNs with memristive crossbars, there is little attention to study on the effect of intrinsic crossbar non-idealities and stochasticity on the performance of SNNs. In this paper, we conduct a comprehensive analysis of the robustness of SNNs on non-ideal crossbars. We examine SNNs trained via learning algorithms such as, surrogate gradient and ANN-SNN conversion. Our results show that repetitive crossbar computations across multiple time-steps induce error accumulation, resulting in a huge performance drop during SNN inference. We further show that SNNs trained with a smaller number of time-steps achieve better accuracy when deployed on memristive crossbars.
{"title":"Examining the Robustness of Spiking Neural Networks on Non-ideal Memristive Crossbars","authors":"Abhiroop Bhattacharjee, Youngeun Kim, Abhishek Moitra, P. Panda","doi":"10.1145/3531437.3539729","DOIUrl":"https://doi.org/10.1145/3531437.3539729","url":null,"abstract":"Spiking Neural Networks (SNNs) have recently emerged as the low-power alternative to Artificial Neural Networks (ANNs) owing to their asynchronous, sparse, and binary information processing. To improve the energy-efficiency and throughput, SNNs can be implemented on memristive crossbars where Multiply-and-Accumulate (MAC) operations are realized in the analog domain using emerging Non-Volatile-Memory (NVM) devices. Despite the compatibility of SNNs with memristive crossbars, there is little attention to study on the effect of intrinsic crossbar non-idealities and stochasticity on the performance of SNNs. In this paper, we conduct a comprehensive analysis of the robustness of SNNs on non-ideal crossbars. We examine SNNs trained via learning algorithms such as, surrogate gradient and ANN-SNN conversion. Our results show that repetitive crossbar computations across multiple time-steps induce error accumulation, resulting in a huge performance drop during SNN inference. We further show that SNNs trained with a smaller number of time-steps achieve better accuracy when deployed on memristive crossbars.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"111 3S 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133423719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite potential quantum supremacy, state-of-the-art quantum neural networks (QNNs) suffer from low inference accuracy. First, the current Noisy Intermediate-Scale Quantum (NISQ) devices with high error rates of 10− 3 to 10− 2 significantly degrade the accuracy of a QNN. Second, although recently proposed Re-Uploading Units (RUUs) introduce some non-linearity into the QNN circuits, the theory behind it is not fully understood. Furthermore, previous RUUs that repeatedly upload original data can only provide marginal accuracy improvements. Third, current QNN circuit ansatz uses fixed two-qubit gates to enforce maximum entanglement capability, making task-specific entanglement tuning impossible, resulting in poor overall performance. In this paper, we propose a Quantum Multilayer Perceptron (QMLP) architecture featured by error-tolerant input embedding, rich nonlinearity, and enhanced variational circuit ansatz with parameterized two-qubit entangling gates. Compared to prior arts, QMLP increases the inference accuracy on the 10-class MNIST dataset by 10% with 2 × fewer quantum gates and 3 × reduced parameters. Our source code is available and can be found in https://github.com/chuchengc/QMLP/.
{"title":"QMLP: An Error-Tolerant Nonlinear Quantum MLP Architecture using Parameterized Two-Qubit Gates","authors":"Cheng Chu, Nai-Hui Chia, Lei Jiang, Fan Chen","doi":"10.1145/3531437.3539719","DOIUrl":"https://doi.org/10.1145/3531437.3539719","url":null,"abstract":"Despite potential quantum supremacy, state-of-the-art quantum neural networks (QNNs) suffer from low inference accuracy. First, the current Noisy Intermediate-Scale Quantum (NISQ) devices with high error rates of 10− 3 to 10− 2 significantly degrade the accuracy of a QNN. Second, although recently proposed Re-Uploading Units (RUUs) introduce some non-linearity into the QNN circuits, the theory behind it is not fully understood. Furthermore, previous RUUs that repeatedly upload original data can only provide marginal accuracy improvements. Third, current QNN circuit ansatz uses fixed two-qubit gates to enforce maximum entanglement capability, making task-specific entanglement tuning impossible, resulting in poor overall performance. In this paper, we propose a Quantum Multilayer Perceptron (QMLP) architecture featured by error-tolerant input embedding, rich nonlinearity, and enhanced variational circuit ansatz with parameterized two-qubit entangling gates. Compared to prior arts, QMLP increases the inference accuracy on the 10-class MNIST dataset by 10% with 2 × fewer quantum gates and 3 × reduced parameters. Our source code is available and can be found in https://github.com/chuchengc/QMLP/.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130661697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}