Xiangren Chen, Bohan Yang, Yong Lu, S. Yin, Shaojun Wei, Leibo Liu
Number Theoretical Transform (NTT) hardware accelerator becomes crucial building block in many cryptosystems like post-quantum cryptography. In this paper, we provide new insights into the construction of conflict-free memory mapping scheme (CFMMS) for multi-bank NTT architecture. Firstly, we offer parallel loop structure of arbitrary-radix NTT and propose two point-fetching modes. Afterwards, we transform the conflict-free mapping problem into conflict graph and develop novel heuristic to explore the design space of CFMMS, which turns out more efficient access scheme than classic works. To further verify the methodology, we design high-performance NTT/INTT kernels for Dilithium, whose area-time efficiency significantly outperforms state-of-the-art works on the similar FPGA platform.
{"title":"Efficient access scheme for multi-bank based NTT architecture through conflict graph","authors":"Xiangren Chen, Bohan Yang, Yong Lu, S. Yin, Shaojun Wei, Leibo Liu","doi":"10.1145/3489517.3530656","DOIUrl":"https://doi.org/10.1145/3489517.3530656","url":null,"abstract":"Number Theoretical Transform (NTT) hardware accelerator becomes crucial building block in many cryptosystems like post-quantum cryptography. In this paper, we provide new insights into the construction of conflict-free memory mapping scheme (CFMMS) for multi-bank NTT architecture. Firstly, we offer parallel loop structure of arbitrary-radix NTT and propose two point-fetching modes. Afterwards, we transform the conflict-free mapping problem into conflict graph and develop novel heuristic to explore the design space of CFMMS, which turns out more efficient access scheme than classic works. To further verify the methodology, we design high-performance NTT/INTT kernels for Dilithium, whose area-time efficiency significantly outperforms state-of-the-art works on the similar FPGA platform.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124002466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Complex event-driven neuron dynamics was an obstacle to implementing efficient brain-inspired computing architectures with VLSI circuits. To solve this problem and harness the event-driven advantage, we propose ASTERS, a resistive random-access memory (ReRAM) based neuromorphic design to conduct the time-to-first-spike SNN inference. In addition to the fundamental novel axon and neuron circuits, we also propose two techniques through hardware-software co-design: "Multi-Level Firing Threshold Adjustment" to mitigate the impact of ReRAM device process variations, and "Timing Threshold Adjustment" to further speed up the computation. Experimental results show that our cross-layer solution ASTERS achieves more than 34.7% energy savings compared to the existing spiking neuromorphic designs, meanwhile maintaining 90.1% accuracy under the process variations with a 20% standard deviation.
{"title":"ASTERS: adaptable threshold spike-timing neuromorphic design with twin-column ReRAM synapses","authors":"Ziru Li, Qilin Zheng, Bonan Yan, Ru Huang, Bing Li, Yiran Chen","doi":"10.1145/3489517.3530591","DOIUrl":"https://doi.org/10.1145/3489517.3530591","url":null,"abstract":"Complex event-driven neuron dynamics was an obstacle to implementing efficient brain-inspired computing architectures with VLSI circuits. To solve this problem and harness the event-driven advantage, we propose ASTERS, a resistive random-access memory (ReRAM) based neuromorphic design to conduct the time-to-first-spike SNN inference. In addition to the fundamental novel axon and neuron circuits, we also propose two techniques through hardware-software co-design: \"Multi-Level Firing Threshold Adjustment\" to mitigate the impact of ReRAM device process variations, and \"Timing Threshold Adjustment\" to further speed up the computation. Experimental results show that our cross-layer solution ASTERS achieves more than 34.7% energy savings compared to the existing spiking neuromorphic designs, meanwhile maintaining 90.1% accuracy under the process variations with a 20% standard deviation.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hybrid flash based storage constructed with high-density and low-cost flash memory are becoming increasingly popular in consumer devices during the last decade. However, to protect critical data, existing methods are designed for improving reliability of consumer devices with non-hybrid flash storage. Based on evaluations and analysis, these methods will result in significant performance and lifetime degradation in consumer devices with hybrid storage. The reason is that different kinds of memory in hybrid storage have different characteristics, such as performance and access granularity. To address the above problems, a critical data backup (CDB) method is proposed to backup designated critical data with making full use of different kinds of memory in hybrid storage. Experiment results show that compared with the state-of-the-arts, CDB achieves encouraging performance and lifetime improvement.
{"title":"CDB: critical data backup design for consumer devices with high-density flash based hybrid storage","authors":"Longfei Luo, Dingcui Yu, Liang Shi, Chuanming Ding, Changlong Li, E. Sha","doi":"10.1145/3489517.3530468","DOIUrl":"https://doi.org/10.1145/3489517.3530468","url":null,"abstract":"Hybrid flash based storage constructed with high-density and low-cost flash memory are becoming increasingly popular in consumer devices during the last decade. However, to protect critical data, existing methods are designed for improving reliability of consumer devices with non-hybrid flash storage. Based on evaluations and analysis, these methods will result in significant performance and lifetime degradation in consumer devices with hybrid storage. The reason is that different kinds of memory in hybrid storage have different characteristics, such as performance and access granularity. To address the above problems, a critical data backup (CDB) method is proposed to backup designated critical data with making full use of different kinds of memory in hybrid storage. Experiment results show that compared with the state-of-the-arts, CDB achieves encouraging performance and lifetime improvement.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"21 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132026136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiheng Yue, Yabing Wang, Leibo Liu, Shaojun Wei, S. Yin
This paper proposes the design of a computation-in-memory for stereo matching cost computation. The matching cost computation incurs large energy and latency overhead because of frequent memory access. To overcome previous design limitations, this work, named MC-CIM, performs matching cost computation without incurring memory access and introduces several key features. (1) Lightweight balanced computing unit is integrated within cell array to reduce memory access and improve system throughput. (2) Self-optimized circuit design enables to alter arithmetic operation for matching algorithm in various scenario. (3) Flexible data mapping method and reconfigurable digital peripheral explore maximum parallelism on different algorithm and bit-precision. The proposed design is implemented in 28nm technology and achieves average performance of 277 TOPs/W.
{"title":"MC-CIM: a reconfigurable computation-in-memory for efficient stereo matching cost computation","authors":"Zhiheng Yue, Yabing Wang, Leibo Liu, Shaojun Wei, S. Yin","doi":"10.1145/3489517.3530477","DOIUrl":"https://doi.org/10.1145/3489517.3530477","url":null,"abstract":"This paper proposes the design of a computation-in-memory for stereo matching cost computation. The matching cost computation incurs large energy and latency overhead because of frequent memory access. To overcome previous design limitations, this work, named MC-CIM, performs matching cost computation without incurring memory access and introduces several key features. (1) Lightweight balanced computing unit is integrated within cell array to reduce memory access and improve system throughput. (2) Self-optimized circuit design enables to alter arithmetic operation for matching algorithm in various scenario. (3) Flexible data mapping method and reconfigurable digital peripheral explore maximum parallelism on different algorithm and bit-precision. The proposed design is implemented in 28nm technology and achieves average performance of 277 TOPs/W.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129172400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Design space exploration (DSE) can automatically and effectively determine design parameters to achieve the optimal performance, power and area (PPA) in very large-scale integration (VLSI) design. The lack of prior knowledge causes low efficient exploration. In this paper, a fast parameter tuning framework via transfer learning and multi-objective Bayesian optimization is proposed to quickly find the optimal design parameters. Gaussian Copula is utilized to establish the correlation of the implemented technology. The prior knowledge is integrated into multi-objective Bayesian optimization through transforming the PPA data to residual observation. The uncertainty-aware search acquisition function is employed to explore design space efficiently. Experiments on a CPU design show that this framework can achieve a higher quality of Pareto frontier with less design flow running than state-of-the-art methodologies.
{"title":"A fast parameter tuning framework via transfer learning and multi-objective bayesian optimization","authors":"Zheng Zhang, Tinghuan Chen, Jiaxin Huang, Meng Zhang","doi":"10.1145/3489517.3530430","DOIUrl":"https://doi.org/10.1145/3489517.3530430","url":null,"abstract":"Design space exploration (DSE) can automatically and effectively determine design parameters to achieve the optimal performance, power and area (PPA) in very large-scale integration (VLSI) design. The lack of prior knowledge causes low efficient exploration. In this paper, a fast parameter tuning framework via transfer learning and multi-objective Bayesian optimization is proposed to quickly find the optimal design parameters. Gaussian Copula is utilized to establish the correlation of the implemented technology. The prior knowledge is integrated into multi-objective Bayesian optimization through transforming the PPA data to residual observation. The uncertainty-aware search acquisition function is employed to explore design space efficiently. Experiments on a CPU design show that this framework can achieve a higher quality of Pareto frontier with less design flow running than state-of-the-art methodologies.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125444000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weight clustering is an effective technique for compressing deep neural networks (DNNs) memory by using a limited number of unique weights and low-bit weight indexes to store clustering information. In this paper, we propose PatterNet, which enforces shared clustering topologies on filters. Cluster sharing leads to a greater extent of memory reduction by reusing the index information. PatterNet effectively factorizes input activations and post-processes the unique weights, which saves multiplications by several orders of magnitude. Furthermore, PatterNet reduces the add operations by harnessing the fact that filters sharing a clustering pattern have the same factorized terms. We introduce techniques for determining and assigning clustering patterns and training a network to fulfill the target patterns. We also propose and implement an efficient accelerator that builds upon the patterned filters. Experimental results show that PatterNet shrinks the memory and operation count up to 80.2% and 73.1%, respectively, with similar accuracy to the baseline models. PatterNet accelerator improves the energy efficiency by 107x over Nvidia 1080 1080 GTX and 2.2x over state of the art.
{"title":"PatterNet","authors":"Behnam Khaleghi, U. Mallappa, Duygu Yaldiz, Haichao Yang, Monil Shah, Jaeyoung Kang, Tajana Rosing","doi":"10.1145/3489517.3530422","DOIUrl":"https://doi.org/10.1145/3489517.3530422","url":null,"abstract":"Weight clustering is an effective technique for compressing deep neural networks (DNNs) memory by using a limited number of unique weights and low-bit weight indexes to store clustering information. In this paper, we propose PatterNet, which enforces shared clustering topologies on filters. Cluster sharing leads to a greater extent of memory reduction by reusing the index information. PatterNet effectively factorizes input activations and post-processes the unique weights, which saves multiplications by several orders of magnitude. Furthermore, PatterNet reduces the add operations by harnessing the fact that filters sharing a clustering pattern have the same factorized terms. We introduce techniques for determining and assigning clustering patterns and training a network to fulfill the target patterns. We also propose and implement an efficient accelerator that builds upon the patterned filters. Experimental results show that PatterNet shrinks the memory and operation count up to 80.2% and 73.1%, respectively, with similar accuracy to the baseline models. PatterNet accelerator improves the energy efficiency by 107x over Nvidia 1080 1080 GTX and 2.2x over state of the art.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121593790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zihan Wang, Chengcheng Wan, Yuting Chen, Ziyi Lin, He Jiang, Lei Qiao
Neural Architecture Search (NAS) is widely used in industry, searching for neural networks meeting task requirements. Meanwhile, it faces a challenge in scheduling networks satisfying memory constraints. This paper proposes HMCOS that performs hierarchical memory-constrained operator scheduling of NAS networks: given a network, HMCOS constructs a hierarchical computation graph and employs an iterative scheduling algorithm to progressively reduce peak memory footprints. We evaluate HMCOS against RPO and Serenity (two popular scheduling techniques). The results show that HMCOS outperforms existing techniques in supporting more NAS networks, reducing 8.7~42.4% of peak memory footprints, and achieving 137--283x of speedups in scheduling.
{"title":"Hierarchical memory-constrained operator scheduling of neural architecture search networks","authors":"Zihan Wang, Chengcheng Wan, Yuting Chen, Ziyi Lin, He Jiang, Lei Qiao","doi":"10.1145/3489517.3530472","DOIUrl":"https://doi.org/10.1145/3489517.3530472","url":null,"abstract":"Neural Architecture Search (NAS) is widely used in industry, searching for neural networks meeting task requirements. Meanwhile, it faces a challenge in scheduling networks satisfying memory constraints. This paper proposes HMCOS that performs hierarchical memory-constrained operator scheduling of NAS networks: given a network, HMCOS constructs a hierarchical computation graph and employs an iterative scheduling algorithm to progressively reduce peak memory footprints. We evaluate HMCOS against RPO and Serenity (two popular scheduling techniques). The results show that HMCOS outperforms existing techniques in supporting more NAS networks, reducing 8.7~42.4% of peak memory footprints, and achieving 137--283x of speedups in scheduling.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116755439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stochastic time-to-digital converters (STDCs) are gaining increasing interest in submicron CMOS analog/mixed-signal design for their superior tolerance to nonlinear quantization levels. However, the large number of required delay units and time comparators for conventional STDC operation incurs excessive implementation costs. This paper presents a fully synthesizable STDC architecture based on an integral non-linearity (INL) scrambling technique, allowing order-of-magnitude cost reduction. The proposed technique randomizes and averages the STDC INL using a digital-to-time converter. Moreover, we propose an associated design automation flow and demonstrate an STDC design in 12nm FinFET process. Post-layout simulations show significant linearity and area/power efficiency improvements compared to prior arts.
{"title":"A cost-efficient fully synthesizable stochastic time-to-digital converter design based on integral nonlinearity scrambling","authors":"Qiaochu Zhang, Shiyu Su, M. Chen","doi":"10.1145/3489517.3530502","DOIUrl":"https://doi.org/10.1145/3489517.3530502","url":null,"abstract":"Stochastic time-to-digital converters (STDCs) are gaining increasing interest in submicron CMOS analog/mixed-signal design for their superior tolerance to nonlinear quantization levels. However, the large number of required delay units and time comparators for conventional STDC operation incurs excessive implementation costs. This paper presents a fully synthesizable STDC architecture based on an integral non-linearity (INL) scrambling technique, allowing order-of-magnitude cost reduction. The proposed technique randomizes and averages the STDC INL using a digital-to-time converter. Moreover, we propose an associated design automation flow and demonstrate an STDC design in 12nm FinFET process. Post-layout simulations show significant linearity and area/power efficiency improvements compared to prior arts.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124987393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yibin Gu, Yifan Li, Hua Wang, L. Liu, Ke Zhou, Wei Fang, Gang Hu, Jinhu Liu, Zhuo Cheng
File storage system (FSS) uses multi-caches to accelerate data accesses. Unfortunately, efficient FSS cache allocation remains extremely difficult. First, as the key of cache allocation, existing miss ratio curve (MRC) constructions are limited to LRU. Second, existing techniques are suitable for same-layer caches but not for hierarchical ones. We present a Learned MRC Profiling based Cache Allocation (LPCA) scheme for FSS. To the best of our knowledge, LPCA is the first to apply machine learning to model MRC under non-LRU, LPCA also explores optimization target for hierarchical caches, in that LPCA can provide universal and efficient cache allocation for FSSs.
{"title":"LPCA: learned MRC profiling based cache allocation for file storage systems","authors":"Yibin Gu, Yifan Li, Hua Wang, L. Liu, Ke Zhou, Wei Fang, Gang Hu, Jinhu Liu, Zhuo Cheng","doi":"10.1145/3489517.3530662","DOIUrl":"https://doi.org/10.1145/3489517.3530662","url":null,"abstract":"File storage system (FSS) uses multi-caches to accelerate data accesses. Unfortunately, efficient FSS cache allocation remains extremely difficult. First, as the key of cache allocation, existing miss ratio curve (MRC) constructions are limited to LRU. Second, existing techniques are suitable for same-layer caches but not for hierarchical ones. We present a Learned MRC Profiling based Cache Allocation (LPCA) scheme for FSS. To the best of our knowledge, LPCA is the first to apply machine learning to model MRC under non-LRU, LPCA also explores optimization target for hierarchical caches, in that LPCA can provide universal and efficient cache allocation for FSSs.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127592328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}