Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design最新文献_第3页

SACS: A Self-Adaptive Checkpointing Strategy for Microkernel-Based Intermittent Systems 微内核间歇系统的自适应检查点策略

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-08-01 DOI: 10.1145/3531437.3539705

Yen-Ting Chen, Han-Xiang Liu, Yuan-Hao Chang, Yu-Pei Liang, W. Shih

Intermittent systems are usually energy-harvesting embedded systems that harvest energy from ambient environment and perform computation intermittently. Due to the unreliable power, these intermittent systems typically adopt different checkpointing strategies for ensuring the data consistency and execution progress after the systems are resumed from unpredictable power failures. Existing checkpointing strategies are usually suitable for bare-metal intermittent systems with short run time. Due to the improvement of energy-harvesting techniques, intermittent systems are having longer run time and better computation power, so that more and more intermittent systems tend to function with a microkernel for handling more/multiple tasks at the same time. However, existing checkpointing strategies were not designed for (or aware of) such microkernel-based intermittent systems that support the running of multiple tasks, and thus have poor performance on preserving the execution progress. To tackle this issue, we propose a design, called self-adaptive checkpointing strategy (SACS), tailored for microkernel-based intermittent systems. By leveraging the time-slicing scheduler, the proposed design dynamically adjust the checkpointing interval at both run time and reboot time, so as to improve the system performance by achieving a good balance between the execution progress and the number of performed checkpoints. A series of experiments was conducted based on a development board of Texas Instrument (TI) with well-known benchmarks. Compared to the state-of-the-art designs, experiment results show that our design could reduce the execution time by at least 46.8% under different conditions of ambient environment while maintaining the number of performed checkpoints in an acceptable scale.

间歇系统通常是从周围环境中获取能量并间歇地进行计算的能量收集嵌入式系统。由于电源不可靠，这些间歇性系统通常采用不同的检查点策略，以确保系统从不可预测的电源故障恢复后的数据一致性和执行进度。现有的检查点策略通常适用于运行时间短的裸机间歇系统。由于能量收集技术的改进，间歇系统具有更长的运行时间和更好的计算能力，因此越来越多的间歇系统倾向于使用微内核同时处理更多/多个任务。然而，现有的检查点策略并不是为(或意识到)这种支持多任务运行的基于微内核的间歇系统而设计的，因此在保留执行进度方面性能很差。为了解决这个问题，我们提出了一种设计，称为自适应检查点策略(SACS)，为基于微内核的间歇系统量身定制。通过利用时间切片调度器，所提出的设计在运行时和重新启动时动态调整检查点间隔，从而通过在执行进度和执行检查点数量之间实现良好的平衡来提高系统性能。在德州仪器(TI)的开发板上进行了一系列的实验。实验结果表明，在不同的环境条件下，我们的设计可以将执行时间减少至少46.8%，同时将执行的检查点数量保持在可接受的范围内。

{"title":"SACS: A Self-Adaptive Checkpointing Strategy for Microkernel-Based Intermittent Systems","authors":"Yen-Ting Chen, Han-Xiang Liu, Yuan-Hao Chang, Yu-Pei Liang, W. Shih","doi":"10.1145/3531437.3539705","DOIUrl":"https://doi.org/10.1145/3531437.3539705","url":null,"abstract":"Intermittent systems are usually energy-harvesting embedded systems that harvest energy from ambient environment and perform computation intermittently. Due to the unreliable power, these intermittent systems typically adopt different checkpointing strategies for ensuring the data consistency and execution progress after the systems are resumed from unpredictable power failures. Existing checkpointing strategies are usually suitable for bare-metal intermittent systems with short run time. Due to the improvement of energy-harvesting techniques, intermittent systems are having longer run time and better computation power, so that more and more intermittent systems tend to function with a microkernel for handling more/multiple tasks at the same time. However, existing checkpointing strategies were not designed for (or aware of) such microkernel-based intermittent systems that support the running of multiple tasks, and thus have poor performance on preserving the execution progress. To tackle this issue, we propose a design, called self-adaptive checkpointing strategy (SACS), tailored for microkernel-based intermittent systems. By leveraging the time-slicing scheduler, the proposed design dynamically adjust the checkpointing interval at both run time and reboot time, so as to improve the system performance by achieving a good balance between the execution progress and the number of performed checkpoints. A series of experiments was conducted based on a development board of Texas Instrument (TI) with well-known benchmarks. Compared to the state-of-the-art designs, experiment results show that our design could reduce the execution time by at least 46.8% under different conditions of ambient environment while maintaining the number of performed checkpoints in an acceptable scale.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132162150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting successive identical words and differences with dynamic bases for effective compression in Non-Volatile Memories 在非易失性存储器中利用连续相同词和动态基的差异进行有效压缩

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-08-01 DOI: 10.1145/3531437.3539716

Swati Upadhyay, Arijit Nath, H. Kapoor

Emerging Non-volatile memories are considered as potential candidates for replacing traditional DRAM in main memory. However, downsides like long write latency, high write energy, and low write endurance make their direct adoption in the memory hierarchy challenging. Approaches that reduce the number of bits written are beneficial to overcome such drawbacks. In this direction, we propose a compression technique that reduces overall bits written to the NVM, thus improving its lifetime. The proposed method, SIBR, compresses the incoming blocks to PCM by either eliminating the words to be written or by reducing the number of bits written for each word. For the former, words that have either zero content or are identical to consecutive words are not written. The latter is done by computing the difference of each word with a base word and storing only the difference (or delta) instead of the full word. The novelty of our contribution is to update the base word at run-time, thus achieving better compression. It is shown that computing the delta with a dynamically decided base compared to a fixed base gives smaller delta values. The dynamic base is another word in the same block. SIBR outperforms two state-of-the-art compression techniques by achieving a fairly low compression ratio and high coverage. Experimental results show a substantial reduction in bit-flips and improvement in lifetime.

新兴的非易失性存储器被认为是取代传统DRAM主存的潜在候选人。然而，诸如长写延迟、高写能量和低写持久性等缺点使它们在内存层次结构中直接采用具有挑战性。减少写入比特数的方法有助于克服这些缺点。在这个方向上，我们提出了一种压缩技术，可以减少写入NVM的总比特数，从而提高其生命周期。所提出的方法，SIBR，通过消除要写入的字或减少为每个字写入的位数，将传入的块压缩为PCM。对于前者，不写入内容为零或与连续单词相同的单词。后者是通过计算每个单词与一个基本单词的差异，并仅存储差异(或增量)而不是整个单词来完成的。我们的贡献的新颖之处在于在运行时更新基本词，从而实现更好的压缩。结果表明，与固定基相比，用动态确定的基计算δ可以得到更小的δ值。动态基是同一块中的另一个单词。通过实现相当低的压缩比和高覆盖率，SIBR优于两种最先进的压缩技术。实验结果表明，大大减少了比特翻转，提高了寿命。

{"title":"Exploiting successive identical words and differences with dynamic bases for effective compression in Non-Volatile Memories","authors":"Swati Upadhyay, Arijit Nath, H. Kapoor","doi":"10.1145/3531437.3539716","DOIUrl":"https://doi.org/10.1145/3531437.3539716","url":null,"abstract":"Emerging Non-volatile memories are considered as potential candidates for replacing traditional DRAM in main memory. However, downsides like long write latency, high write energy, and low write endurance make their direct adoption in the memory hierarchy challenging. Approaches that reduce the number of bits written are beneficial to overcome such drawbacks. In this direction, we propose a compression technique that reduces overall bits written to the NVM, thus improving its lifetime. The proposed method, SIBR, compresses the incoming blocks to PCM by either eliminating the words to be written or by reducing the number of bits written for each word. For the former, words that have either zero content or are identical to consecutive words are not written. The latter is done by computing the difference of each word with a base word and storing only the difference (or delta) instead of the full word. The novelty of our contribution is to update the base word at run-time, thus achieving better compression. It is shown that computing the delta with a dynamically decided base compared to a fixed base gives smaller delta values. The dynamic base is another word in the same block. SIBR outperforms two state-of-the-art compression techniques by achieving a fairly low compression ratio and high coverage. Experimental results show a substantial reduction in bit-flips and improvement in lifetime.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122362965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Tightly Linking 3D Via Allocation Towards Routing Optimization for Monolithic 3D ICs 基于分配的单片3D芯片紧密连接路由优化

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-08-01 DOI: 10.1145/3531437.3539714

Suwan Kim, Sehyeon Chung, Taewhan Kim, Heechun Park

Monolithic 3D (M3D) is a revolutionary technology for high-density and high-performance chip design in the post-Moore era. However, it suffers from considerable thermal confinement due to the transistor stacking and insulating materials between the layers. As a way of reducing power, thereby mitigating the thermal problem, we propose a comprehensive physical design methodology that incorporates two new important items, one is blockage aware MIV (monolithic inter-tier via) placement and the other is 3D net ordering for routing, intending to optimize wire length. Precisely, we propose a three-step approach: (1) retrieving the MIV region candidates for each 3D net, (2) fine-tuning placement to secure MIV spots in the presence of blockages, and (3) performing M3D routing with net ordering to consider the fine-tuned placement result. We implement the proposed M3D design flow by utilizing commercial 2D IC EDA tools while providing seamless optimization for cross-tier connections. In the meantime, our experiments confirm that proposed M3D design flow saves wire length per cross-tier net by up to 41.42%, which corresponds to 7.68% less total net switching power, equivalently 36.79% lower energy-delay-product over the conventional state-of-the-art M3D design flow.

单片3D (M3D)技术是后摩尔时代高密度、高性能芯片设计的革命性技术。然而，由于晶体管堆叠和层之间的绝缘材料，它受到相当大的热限制。作为一种降低功耗从而缓解热问题的方法，我们提出了一种综合的物理设计方法，其中包含两个新的重要项目，一个是阻塞感知MIV(单片层间通道)放置，另一个是用于路由的3D网络排序，旨在优化导线长度。准确地说，我们提出了一个三步方法:(1)检索每个3D网络的MIV区域候选，(2)微调放置以在存在阻塞的情况下确保MIV点，以及(3)使用网络排序执行M3D路由以考虑微调放置结果。我们利用商业2D集成电路EDA工具实现了M3D设计流程，同时为跨层连接提供了无缝优化。与此同时，我们的实验证实，所提出的M3D设计流程比传统的最先进的M3D设计流程节省了41.42%的跨层网络导线长度，相当于减少了7.68%的总净开关功率，相当于减少了36.79%的能量延迟积。

{"title":"Tightly Linking 3D Via Allocation Towards Routing Optimization for Monolithic 3D ICs","authors":"Suwan Kim, Sehyeon Chung, Taewhan Kim, Heechun Park","doi":"10.1145/3531437.3539714","DOIUrl":"https://doi.org/10.1145/3531437.3539714","url":null,"abstract":"Monolithic 3D (M3D) is a revolutionary technology for high-density and high-performance chip design in the post-Moore era. However, it suffers from considerable thermal confinement due to the transistor stacking and insulating materials between the layers. As a way of reducing power, thereby mitigating the thermal problem, we propose a comprehensive physical design methodology that incorporates two new important items, one is blockage aware MIV (monolithic inter-tier via) placement and the other is 3D net ordering for routing, intending to optimize wire length. Precisely, we propose a three-step approach: (1) retrieving the MIV region candidates for each 3D net, (2) fine-tuning placement to secure MIV spots in the presence of blockages, and (3) performing M3D routing with net ordering to consider the fine-tuned placement result. We implement the proposed M3D design flow by utilizing commercial 2D IC EDA tools while providing seamless optimization for cross-tier connections. In the meantime, our experiments confirm that proposed M3D design flow saves wire length per cross-tier net by up to 41.42%, which corresponds to 7.68% less total net switching power, equivalently 36.79% lower energy-delay-product over the conventional state-of-the-art M3D design flow.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"307 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132937372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design and Logic Synthesis of a Scalable, Efficient Quantum Number Theoretic Transform 一种可扩展、高效量子数论变换的设计与逻辑综合

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-08-01 DOI: 10.1145/3531437.3543827

Chao Lu, Shamik Kundu, Abraham Peedikayil Kuruvila, Supriya Margabandhu Ravichandran, K. Basu

The advent of quantum computing has engendered a widespread proliferation of efforts utilizing qubits for optimizing classical computational algorithms. Number Theoretic Transform (NTT) is one such popular algorithm that accelerates polynomial multiplication significantly and is consequently, the core arithmetic operation in most homomorphic encryption algorithms. Hence, fast and efficient execution of NTT is highly imperative for practical implementation of homomorphic encryption schemes in different computing paradigms. In this paper, we, for the first time, propose an efficient and scalable Quantum Number Theoretic Transform (QNTT) circuit using quantum gates. We introduce a novel exponential unit for modular exponential operation, which furnishes an algorithmic complexity of O(n). Our proposed methodology performs further optimization and logic synthesis of QNTT, that is significantly fast and facilitates efficient implementations on IBM’s quantum computers. The optimized QNTT achieves a gate-level complexity reduction from power of two to one with respect to bit length. Our methodology utilizes 44.2% fewer gates, thereby minimizing the circuit depth and a corresponding reduction in overhead and error probability, for a 4-point QNTT compared to its unoptimized counterpart.

量子计算的出现导致了利用量子比特优化经典计算算法的努力的广泛扩散。数论变换(Number theoretical Transform, NTT)是一种非常流行的算法，它能显著地加速多项式乘法运算，因此是大多数同态加密算法中的核心运算。因此，快速有效地执行NTT对于在不同计算范式下实现同态加密方案是非常必要的。在本文中，我们首次提出了一种利用量子门的高效、可扩展的量子数论变换(QNTT)电路。我们为模指数运算引入了一种新的指数单元，它提供了O(n)的算法复杂度。我们提出的方法对QNTT进行了进一步的优化和逻辑合成，这大大加快了速度，并促进了IBM量子计算机上的有效实现。优化后的QNTT实现了相对于位长度从2的幂到1的门级复杂度降低。与未优化的QNTT相比，我们的方法减少了44.2%的门，从而最大限度地减少了电路深度，并相应减少了开销和错误概率。

{"title":"Design and Logic Synthesis of a Scalable, Efficient Quantum Number Theoretic Transform","authors":"Chao Lu, Shamik Kundu, Abraham Peedikayil Kuruvila, Supriya Margabandhu Ravichandran, K. Basu","doi":"10.1145/3531437.3543827","DOIUrl":"https://doi.org/10.1145/3531437.3543827","url":null,"abstract":"The advent of quantum computing has engendered a widespread proliferation of efforts utilizing qubits for optimizing classical computational algorithms. Number Theoretic Transform (NTT) is one such popular algorithm that accelerates polynomial multiplication significantly and is consequently, the core arithmetic operation in most homomorphic encryption algorithms. Hence, fast and efficient execution of NTT is highly imperative for practical implementation of homomorphic encryption schemes in different computing paradigms. In this paper, we, for the first time, propose an efficient and scalable Quantum Number Theoretic Transform (QNTT) circuit using quantum gates. We introduce a novel exponential unit for modular exponential operation, which furnishes an algorithmic complexity of O(n). Our proposed methodology performs further optimization and logic synthesis of QNTT, that is significantly fast and facilitates efficient implementations on IBM’s quantum computers. The optimized QNTT achieves a gate-level complexity reduction from power of two to one with respect to bit length. Our methodology utilizes 44.2% fewer gates, thereby minimizing the circuit depth and a corresponding reduction in overhead and error probability, for a 4-point QNTT compared to its unoptimized counterpart.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129548381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Improving Performance and Power by Co-Optimizing Middle-of-Line Routing, Pin Pattern Generation, and Contact over Active Gates in Standard Cell Layout Synthesis 在标准单元布局合成中，通过共同优化线中路由、引脚模式生成和有源门接触来提高性能和功耗

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-08-01 DOI: 10.1145/3531437.3539712

Sehyeon Chung, Jooyeon Jeong, Taewhan Kim

This paper addresses the combined problem of the three core tasks, namely routing on the middle-of-line (MOL) layer, generating I/O pin patterns (PP), and allocating contacts over active gates (COAG) in cell layout synthesis with 7nm and below technology. As yet, the existing cell layout generators have paid partial or little attention to those tasks, even with no awareness of the synergistic effects. This work overcomes this limitation by proposing a systematic and tightly-linked solution to the combined problem to boost the synergistic effects on chip implementation. Precisely, we solve the problem in three steps: (1) fully utilizing the horizontal routing resource on MOL layer by formulating the problem of in-cell routing into a weighted interval scheduling problem, (2) simultaneously performing the remaining horizontal in-cell routing and PP generation on metal 1 layer through the COAG exploitation while ensuring the pin accessibility constraint, and (3) completing in-cell routing by allocating vertical routing resource on MOL layer. Through experiments with benchmark designs, it is shown that our proposed layout method is able to generate standard cells with on average 34.2% shorter total length of metal 1 wire while retaining pin patterns that ensure pin accessibility, resulting in the chip implementations with up to 72.5% timing slack improvement and up to 15.6% power reduction that produced by using the conventional best available cells. In addition, by using less wire and vias, our in-cell router is able to consistently reduce the worst delay of cells, noticeably, reducing the sum of setup time and clock-to-Q delay of flip-flops by 1.2% ∼ 3.0% on average over that by the existing best cells.

本文解决了在7nm及以下技术的单元布局合成中三个核心任务的组合问题，即在中线(MOL)层上路由，生成I/O引脚模式(PP)，以及在有源门(COAG)上分配触点。到目前为止，现有的单元格布局生成器已经部分或很少关注这些任务，甚至没有意识到协同效应。这项工作通过提出一个系统的和紧密联系的解决方案来克服这一限制，以提高芯片实现的协同效应。具体来说，我们分三步解决:(1)通过将小区内路由问题转化为加权区间调度问题，充分利用MOL层上的水平路由资源;(2)在保证引脚可达性约束的情况下，通过COAG开发，在金属1层上同时进行剩余的水平小区内路由和PP生成;(3)在MOL层上分配垂直路由资源，完成小区内路由。通过基准设计的实验表明，我们提出的布局方法能够产生平均缩短34.2%金属线总长度的标准单元，同时保留确保引脚可达性的引脚模式，从而使芯片实现的时间松弛改善高达72.5%，功耗降低高达15.6%。此外，通过使用更少的导线和通孔，我们的小区内路由器能够持续减少小区的最差延迟，显著地减少触发器的设置时间和时钟到q延迟的总数，比现有最佳小区平均减少1.2% ~ 3.0%。

{"title":"Improving Performance and Power by Co-Optimizing Middle-of-Line Routing, Pin Pattern Generation, and Contact over Active Gates in Standard Cell Layout Synthesis","authors":"Sehyeon Chung, Jooyeon Jeong, Taewhan Kim","doi":"10.1145/3531437.3539712","DOIUrl":"https://doi.org/10.1145/3531437.3539712","url":null,"abstract":"This paper addresses the combined problem of the three core tasks, namely routing on the middle-of-line (MOL) layer, generating I/O pin patterns (PP), and allocating contacts over active gates (COAG) in cell layout synthesis with 7nm and below technology. As yet, the existing cell layout generators have paid partial or little attention to those tasks, even with no awareness of the synergistic effects. This work overcomes this limitation by proposing a systematic and tightly-linked solution to the combined problem to boost the synergistic effects on chip implementation. Precisely, we solve the problem in three steps: (1) fully utilizing the horizontal routing resource on MOL layer by formulating the problem of in-cell routing into a weighted interval scheduling problem, (2) simultaneously performing the remaining horizontal in-cell routing and PP generation on metal 1 layer through the COAG exploitation while ensuring the pin accessibility constraint, and (3) completing in-cell routing by allocating vertical routing resource on MOL layer. Through experiments with benchmark designs, it is shown that our proposed layout method is able to generate standard cells with on average 34.2% shorter total length of metal 1 wire while retaining pin patterns that ensure pin accessibility, resulting in the chip implementations with up to 72.5% timing slack improvement and up to 15.6% power reduction that produced by using the conventional best available cells. In addition, by using less wire and vias, our in-cell router is able to consistently reduce the worst delay of cells, noticeably, reducing the sum of setup time and clock-to-Q delay of flip-flops by 1.2% ∼ 3.0% on average over that by the existing best cells.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115408826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sealer: In-SRAM AES for High-Performance and Low-Overhead Memory Encryption 密封器:用于高性能和低开销内存加密的sram AES

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-07-04 DOI: 10.1145/3531437.3539699

Jingyao Zhang, Hoda Naghibijouybari, Elaheh Sadredini

To provide data and code confidentiality and reduce the risk of information leak from memory or memory bus, computing systems are enhanced with encryption and decryption engine. Despite massive efforts in designing hardware enhancements for data and code protection, existing solutions incur significant performance overhead as the encryption/decryption is on the critical path. In this paper, we present Sealer, a high-performance and low-overhead in-SRAM memory encryption engine by exploiting the massive parallelism and bitline computational capability of SRAM subarrays. Sealer encrypts data before sending it off-chip and decrypts it upon receiving the memory blocks, thus, providing data confidentiality. Our proposed solution requires only minimal modifications to the existing SRAM peripheral circuitry. Sealer can achieve up to two orders of magnitude throughput-per-area improvement while consuming 3 × less energy compared to prior solutions.

为了保证数据和代码的保密性，降低信息从内存或内存总线泄露的风险，计算系统采用加解密引擎进行增强。尽管在设计用于数据和代码保护的硬件增强方面做了大量的工作，但由于加密/解密在关键路径上，现有的解决方案会产生显著的性能开销。在本文中，我们提出了一个高性能和低开销的SRAM内存加密引擎Sealer，它利用了SRAM子阵列的大量并行性和位行计算能力。Sealer在将数据发送到芯片外之前对其进行加密，并在接收到存储块时对其进行解密，从而提供数据机密性。我们提出的解决方案只需要对现有的SRAM外围电路进行最小的修改。与之前的解决方案相比，封口机可以实现高达两个数量级的单位面积吞吐量提高，同时消耗的能量减少3倍。

引用次数: 6

Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators 用于降低卷积神经网络加速器延迟和功耗的稀疏周期收缩数据流

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-06-30 DOI: 10.1145/3531437.3539715

J. Heo, A. Fayyazi, Amirhossein Esmaili, M. Pedram

This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49 × energy efficiency gain while lowering storage requirements by 3.67 × for total weight storage (non-pruned weights plus indexing) and 22,044 × for indexing memory.

本文介绍了稀疏周期收缩(SPS)数据流，提出了支持轻量级神经网络的最先进的硬件加速器。具体来说，SPS数据流支持一种新的硬件设计方法，即基于周期性模式的稀疏性(PPS)。通过利用PPS的规则性，我们的稀疏感知编译器以最佳方式重新排序权重，并使用硬件中的简单索引单元在权重和激活之间创建匹配。通过编译器和硬件的协同设计，SPS数据流在没有高索引开销和模型精度损失的情况下具有更高的并行度。在VGG和ResNet等流行基准测试中，SPS数据流和附带的神经网络编译器优于先前针对FPGA设备的卷积神经网络(CNN)加速器设计。与其他支持稀疏性的权重存储格式相比，SPS的能效提高了4.49倍，同时将总权重存储(非修剪权重加上索引)的存储需求降低了3.67倍，将索引内存的存储需求降低了22,044倍。

{"title":"Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators","authors":"J. Heo, A. Fayyazi, Amirhossein Esmaili, M. Pedram","doi":"10.1145/3531437.3539715","DOIUrl":"https://doi.org/10.1145/3531437.3539715","url":null,"abstract":"This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49 × energy efficiency gain while lowering storage requirements by 3.67 × for total weight storage (non-pruned weights plus indexing) and 22,044 × for indexing memory.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133640445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Enabling Capsule Networks at the Edge through Approximate Softmax and Squash Operations 通过Approximate Softmax和Squash操作在边缘启用胶囊网络

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-06-21 DOI: 10.1145/3531437.3539717

Alberto Marchisio, Beatrice Bussolino, Edoardo Salvati, M. Martina, G. Masera, M. Shafique

Complex Deep Neural Networks such as Capsule Networks (CapsNets) exhibit high learning capabilities at the cost of compute-intensive operations. To enable their deployment on edge devices, we propose to leverage approximate computing for designing approximate variants of the complex operations like softmax and squash. In our experiments, we evaluate tradeoffs between area, power consumption, and critical path delay of the designs implemented with the ASIC design flow, and the accuracy of the quantized CapsNets, compared to the exact functions.

复杂的深度神经网络，如胶囊网络(CapsNets)以计算密集型操作为代价，展示了高学习能力。为了使它们能够在边缘设备上部署，我们建议利用近似计算来设计复杂操作(如softmax和squash)的近似变体。在我们的实验中，我们评估了用ASIC设计流程实现的设计的面积，功耗和关键路径延迟之间的权衡，以及量化capnet与精确功能相比的准确性。

引用次数: 6

Examining the Robustness of Spiking Neural Networks on Non-ideal Memristive Crossbars 脉冲神经网络在非理想记忆杆上的鲁棒性研究

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-06-20 DOI: 10.1145/3531437.3539729

Abhiroop Bhattacharjee, Youngeun Kim, Abhishek Moitra, P. Panda

Spiking Neural Networks (SNNs) have recently emerged as the low-power alternative to Artificial Neural Networks (ANNs) owing to their asynchronous, sparse, and binary information processing. To improve the energy-efficiency and throughput, SNNs can be implemented on memristive crossbars where Multiply-and-Accumulate (MAC) operations are realized in the analog domain using emerging Non-Volatile-Memory (NVM) devices. Despite the compatibility of SNNs with memristive crossbars, there is little attention to study on the effect of intrinsic crossbar non-idealities and stochasticity on the performance of SNNs. In this paper, we conduct a comprehensive analysis of the robustness of SNNs on non-ideal crossbars. We examine SNNs trained via learning algorithms such as, surrogate gradient and ANN-SNN conversion. Our results show that repetitive crossbar computations across multiple time-steps induce error accumulation, resulting in a huge performance drop during SNN inference. We further show that SNNs trained with a smaller number of time-steps achieve better accuracy when deployed on memristive crossbars.

脉冲神经网络(snn)由于其异步、稀疏和二进制信息处理的特点，最近成为人工神经网络(ann)的低功耗替代品。为了提高能源效率和吞吐量，snn可以在记忆交叉棒上实现，其中使用新兴的非易失性存储器(NVM)器件在模拟域中实现乘法和累积(MAC)操作。尽管snn与忆阻交叉棒具有一定的相容性，但其固有交叉棒的非理想性和随机性对snn性能影响的研究却很少。在本文中，我们对snn在非理想交叉棒上的鲁棒性进行了全面分析。我们研究了通过学习算法训练的snn，如代理梯度和ANN-SNN转换。我们的研究结果表明，跨多个时间步长的重复交叉条计算会导致误差累积，导致SNN推理期间的性能大幅下降。我们进一步表明，使用更少的时间步长训练的snn在记忆交叉棒上部署时可以获得更好的精度。

{"title":"Examining the Robustness of Spiking Neural Networks on Non-ideal Memristive Crossbars","authors":"Abhiroop Bhattacharjee, Youngeun Kim, Abhishek Moitra, P. Panda","doi":"10.1145/3531437.3539729","DOIUrl":"https://doi.org/10.1145/3531437.3539729","url":null,"abstract":"Spiking Neural Networks (SNNs) have recently emerged as the low-power alternative to Artificial Neural Networks (ANNs) owing to their asynchronous, sparse, and binary information processing. To improve the energy-efficiency and throughput, SNNs can be implemented on memristive crossbars where Multiply-and-Accumulate (MAC) operations are realized in the analog domain using emerging Non-Volatile-Memory (NVM) devices. Despite the compatibility of SNNs with memristive crossbars, there is little attention to study on the effect of intrinsic crossbar non-idealities and stochasticity on the performance of SNNs. In this paper, we conduct a comprehensive analysis of the robustness of SNNs on non-ideal crossbars. We examine SNNs trained via learning algorithms such as, surrogate gradient and ANN-SNN conversion. Our results show that repetitive crossbar computations across multiple time-steps induce error accumulation, resulting in a huge performance drop during SNN inference. We further show that SNNs trained with a smaller number of time-steps achieve better accuracy when deployed on memristive crossbars.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"111 3S 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133423719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

QMLP: An Error-Tolerant Nonlinear Quantum MLP Architecture using Parameterized Two-Qubit Gates QMLP:一种基于参数化双量子比特门的容错非线性量子MLP架构

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Pub Date : 2022-06-03 DOI: 10.1145/3531437.3539719

Cheng Chu, Nai-Hui Chia, Lei Jiang, Fan Chen

Despite potential quantum supremacy, state-of-the-art quantum neural networks (QNNs) suffer from low inference accuracy. First, the current Noisy Intermediate-Scale Quantum (NISQ) devices with high error rates of 10− 3 to 10− 2 significantly degrade the accuracy of a QNN. Second, although recently proposed Re-Uploading Units (RUUs) introduce some non-linearity into the QNN circuits, the theory behind it is not fully understood. Furthermore, previous RUUs that repeatedly upload original data can only provide marginal accuracy improvements. Third, current QNN circuit ansatz uses fixed two-qubit gates to enforce maximum entanglement capability, making task-specific entanglement tuning impossible, resulting in poor overall performance. In this paper, we propose a Quantum Multilayer Perceptron (QMLP) architecture featured by error-tolerant input embedding, rich nonlinearity, and enhanced variational circuit ansatz with parameterized two-qubit entangling gates. Compared to prior arts, QMLP increases the inference accuracy on the 10-class MNIST dataset by 10% with 2 × fewer quantum gates and 3 × reduced parameters. Our source code is available and can be found in https://github.com/chuchengc/QMLP/.

尽管具有潜在的量子霸权，但最先进的量子神经网络(QNNs)的推理精度较低。首先，目前的噪声中尺度量子(NISQ)器件具有10−3到10−2的高错误率，显著降低了QNN的精度。其次，虽然最近提出的重上传单元(ruu)在QNN电路中引入了一些非线性，但其背后的理论尚未完全理解。此外，以前重复上传原始数据的ruu只能提供边际精度提高。第三，目前的QNN电路分析使用固定的双量子比特门来增强最大的纠缠能力，使得特定任务的纠缠无法调整，导致整体性能不佳。在本文中，我们提出了一种量子多层感知器(QMLP)架构，该架构具有容错输入嵌入，丰富的非线性和增强的变分电路ansatz与参数化的双量子比特纠缠门。与现有技术相比，QMLP在10类MNIST数据集上的推理精度提高了10%，量子门减少了2倍，参数减少了3倍。我们的源代码可以在https://github.com/chuchengc/QMLP/找到。

{"title":"QMLP: An Error-Tolerant Nonlinear Quantum MLP Architecture using Parameterized Two-Qubit Gates","authors":"Cheng Chu, Nai-Hui Chia, Lei Jiang, Fan Chen","doi":"10.1145/3531437.3539719","DOIUrl":"https://doi.org/10.1145/3531437.3539719","url":null,"abstract":"Despite potential quantum supremacy, state-of-the-art quantum neural networks (QNNs) suffer from low inference accuracy. First, the current Noisy Intermediate-Scale Quantum (NISQ) devices with high error rates of 10− 3 to 10− 2 significantly degrade the accuracy of a QNN. Second, although recently proposed Re-Uploading Units (RUUs) introduce some non-linearity into the QNN circuits, the theory behind it is not fully understood. Furthermore, previous RUUs that repeatedly upload original data can only provide marginal accuracy improvements. Third, current QNN circuit ansatz uses fixed two-qubit gates to enforce maximum entanglement capability, making task-specific entanglement tuning impossible, resulting in poor overall performance. In this paper, we propose a Quantum Multilayer Perceptron (QMLP) architecture featured by error-tolerant input embedding, rich nonlinearity, and enhanced variational circuit ansatz with parameterized two-qubit entangling gates. Compared to prior arts, QMLP increases the inference accuracy on the 10-class MNIST dataset by 10% with 2 × fewer quantum gates and 3 × reduced parameters. Our source code is available and can be found in https://github.com/chuchengc/QMLP/.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130661697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5