Abhinav Goel, Caleb Tung, Nick Eliopoulos, Xiao Hu, G. Thiruvathukal, James C. Davis, Yung-Hsiang Lu
Processing visual data on mobile devices has many applications, e.g., emergency response and tracking. State-of-the-art computer vision techniques rely on large Deep Neural Networks (DNNs) that are usually too power-hungry to be deployed on resource-constrained edge devices. Many techniques improve DNN efficiency of DNNs by compromising accuracy. However, the accuracy and efficiency of these techniques cannot be adapted for diverse edge applications with different hardware constraints and accuracy requirements. This paper demonstrates that a recent, efficient tree-based DNN architecture, called the hierarchical DNN, can be converted into a Directed Acyclic Graph-based (DAG) architecture to provide tunable accuracy-efficiency tradeoff options. We propose a systematic method that identifies the connections that must be added to convert the tree to a DAG to improve accuracy. We conduct experiments on popular edge devices and show that increasing the connectivity of the DAG improves the accuracy to within 1% of the existing high accuracy techniques. Our approach requires 93% less memory, 43% less energy, and 49% fewer operations than the high accuracy techniques, thus providing more accuracy-efficiency configurations.
{"title":"Directed Acyclic Graph-based Neural Networks for Tunable Low-Power Computer Vision","authors":"Abhinav Goel, Caleb Tung, Nick Eliopoulos, Xiao Hu, G. Thiruvathukal, James C. Davis, Yung-Hsiang Lu","doi":"10.1145/3531437.3539723","DOIUrl":"https://doi.org/10.1145/3531437.3539723","url":null,"abstract":"Processing visual data on mobile devices has many applications, e.g., emergency response and tracking. State-of-the-art computer vision techniques rely on large Deep Neural Networks (DNNs) that are usually too power-hungry to be deployed on resource-constrained edge devices. Many techniques improve DNN efficiency of DNNs by compromising accuracy. However, the accuracy and efficiency of these techniques cannot be adapted for diverse edge applications with different hardware constraints and accuracy requirements. This paper demonstrates that a recent, efficient tree-based DNN architecture, called the hierarchical DNN, can be converted into a Directed Acyclic Graph-based (DAG) architecture to provide tunable accuracy-efficiency tradeoff options. We propose a systematic method that identifies the connections that must be added to convert the tree to a DAG to improve accuracy. We conduct experiments on popular edge devices and show that increasing the connectivity of the DAG improves the accuracy to within 1% of the existing high accuracy techniques. Our approach requires 93% less memory, 43% less energy, and 49% fewer operations than the high accuracy techniques, thus providing more accuracy-efficiency configurations.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126310513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abinand Nallathambi, Sanchari Sen, A. Raghunathan, N. Chandrachoodan
Spiking Neural Networks (SNNs) have attracted considerable attention due to their suitability to processing temporal input streams, as well as the emergence of highly power-efficient neuromorphic hardware platforms. The computational cost of evaluating a Spiking Neural Network (SNN) is strongly correlated with the number of timesteps for which it is evaluated. To improve the computational efficiency of SNN evaluation, we propose layerwise disaggregated SNNs (LD-SNNs), wherein the number of timesteps is independently optimized for each layer of the network. In effect, LD-SNNs allow for a better allocation of computational effort across layers in a network, resulting in an improved tradeoff between accuracy and efficiency. We propose a methodology to design optimized LD-SNNs from any given SNN. Across four benchmark networks, LD-SNNs achieve 1.67-3.84x reduction in synaptic updates and 1.2-2.56x reduction in neurons evaluated. These improvements translate to 1.25-3.45x faster inference on four different hardware platforms including two server-class platforms, a desktop platform and an edge SoC.
{"title":"Layerwise Disaggregated Evaluation of Spiking Neural Networks","authors":"Abinand Nallathambi, Sanchari Sen, A. Raghunathan, N. Chandrachoodan","doi":"10.1145/3531437.3539708","DOIUrl":"https://doi.org/10.1145/3531437.3539708","url":null,"abstract":"Spiking Neural Networks (SNNs) have attracted considerable attention due to their suitability to processing temporal input streams, as well as the emergence of highly power-efficient neuromorphic hardware platforms. The computational cost of evaluating a Spiking Neural Network (SNN) is strongly correlated with the number of timesteps for which it is evaluated. To improve the computational efficiency of SNN evaluation, we propose layerwise disaggregated SNNs (LD-SNNs), wherein the number of timesteps is independently optimized for each layer of the network. In effect, LD-SNNs allow for a better allocation of computational effort across layers in a network, resulting in an improved tradeoff between accuracy and efficiency. We propose a methodology to design optimized LD-SNNs from any given SNN. Across four benchmark networks, LD-SNNs achieve 1.67-3.84x reduction in synaptic updates and 1.2-2.56x reduction in neurons evaluated. These improvements translate to 1.25-3.45x faster inference on four different hardware platforms including two server-class platforms, a desktop platform and an edge SoC.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126665581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reena Elangovan, Ashish Ranjan, Niharika Thakuria, S. Gupta, A. Raghunathan
Piezoelectric FETs (PeFETs) are a promising class of ferroelectric devices that use the piezoelectric effect to modulate strain in the channel. They present several desirable properties for on-chip memory, such as non-volatility, high-density, and low-power write capability. In this work, we present the first effort to design and evaluate cache architectures using PeFETs. Two key goals in cache design are to maximize capacity and minimize latency. Accordingly, we consider two different variants of PeFET bit-cells - a high-density variant (HD-PeFET) that does not use a separate access transistor, and a high-performance 1T-1PeFET variant (HP-PeFET) that sacrifices density for lower access latency. We note that at the application level, there exists significant heterogeneity in the sensitivity of applications to cache capacity and latency. To enable a better tradeoff between these conflicting design goals, we propose a hybrid PeFET cache comprising of both HP-PeFET and HD-PeFET regions at the granularity of cache ways. We make the key observation that frequently reused blocks residing in the HD-PeFET region are detrimental to overall cache performance due to the higher access latency. Hence, we also propose a cache management policy to identify and migrate these blocks from the HD-PeFET region to the HP-PeFET region at runtime. We develop models of HD-PeFET and HP-PeFET caches using the CACTI framework and evaluate their benefits across a suite of PARSEC and SPLASH-2X benchmarks. We demonstrate 1.11x and 4.55x average improvements in performance and energy, respectively, using the proposed hybrid PeFET last-level cache against a baseline with traditional SRAM cache at iso-area.
{"title":"Energy Efficient Cache Design with Piezoelectric FETs","authors":"Reena Elangovan, Ashish Ranjan, Niharika Thakuria, S. Gupta, A. Raghunathan","doi":"10.1145/3531437.3539727","DOIUrl":"https://doi.org/10.1145/3531437.3539727","url":null,"abstract":"Piezoelectric FETs (PeFETs) are a promising class of ferroelectric devices that use the piezoelectric effect to modulate strain in the channel. They present several desirable properties for on-chip memory, such as non-volatility, high-density, and low-power write capability. In this work, we present the first effort to design and evaluate cache architectures using PeFETs. Two key goals in cache design are to maximize capacity and minimize latency. Accordingly, we consider two different variants of PeFET bit-cells - a high-density variant (HD-PeFET) that does not use a separate access transistor, and a high-performance 1T-1PeFET variant (HP-PeFET) that sacrifices density for lower access latency. We note that at the application level, there exists significant heterogeneity in the sensitivity of applications to cache capacity and latency. To enable a better tradeoff between these conflicting design goals, we propose a hybrid PeFET cache comprising of both HP-PeFET and HD-PeFET regions at the granularity of cache ways. We make the key observation that frequently reused blocks residing in the HD-PeFET region are detrimental to overall cache performance due to the higher access latency. Hence, we also propose a cache management policy to identify and migrate these blocks from the HD-PeFET region to the HP-PeFET region at runtime. We develop models of HD-PeFET and HP-PeFET caches using the CACTI framework and evaluate their benefits across a suite of PARSEC and SPLASH-2X benchmarks. We demonstrate 1.11x and 4.55x average improvements in performance and energy, respectively, using the proposed hybrid PeFET last-level cache against a baseline with traditional SRAM cache at iso-area.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126238772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although systolic accelerators have become the dominant method for executing Deep Neural Networks (DNNs), their performance efficiency (quantified as Energy-Delay Product or EDP) is limited by the capabilities of silicon Field-Effect Transistors (FETs). FETs constructed from Carbon Nanotubes (CNTs) have demonstrated > 10 × EDP benefits, however, the processing variations inherent in carbon nanotube FETs (CNFETs) fabrication compromise the EDP benefits, resulting > 40% performance degradation. In this work, we study the impact of CNT process variations and present Canopy, a process variation aware systolic DNN accelerator by leveraging the spatial correlation in CNT variations. Canopy co-optimizes the architecture and dataflow to allow computing engines in a systolic array run at their best performance with non-uniform latency, minimizing the performance degradation incurred by CNT variations. Furthermore, we devise Canopy with dynamic reconfigurability such that the microarchitectural capability and its associated flexibility achieves an extra degree of adaptability with regard to the DNN topology and processing hyper-parameters (e.g., batch size). Experimental results show that Canopy improves the performance by 5.85 × (4.66 ×) and reduces the energy by 34% (90%) when inferencing a single (a batch of) input compared to the baseline design under an iso-area comparison across seven DNN workloads.
{"title":"Canopy: A CNFET-based Process Variation Aware Systolic DNN Accelerator","authors":"Cheng Chu, Dawen Xu, Ying Wang, Fan Chen","doi":"10.1145/3531437.3539703","DOIUrl":"https://doi.org/10.1145/3531437.3539703","url":null,"abstract":"Although systolic accelerators have become the dominant method for executing Deep Neural Networks (DNNs), their performance efficiency (quantified as Energy-Delay Product or EDP) is limited by the capabilities of silicon Field-Effect Transistors (FETs). FETs constructed from Carbon Nanotubes (CNTs) have demonstrated > 10 × EDP benefits, however, the processing variations inherent in carbon nanotube FETs (CNFETs) fabrication compromise the EDP benefits, resulting > 40% performance degradation. In this work, we study the impact of CNT process variations and present Canopy, a process variation aware systolic DNN accelerator by leveraging the spatial correlation in CNT variations. Canopy co-optimizes the architecture and dataflow to allow computing engines in a systolic array run at their best performance with non-uniform latency, minimizing the performance degradation incurred by CNT variations. Furthermore, we devise Canopy with dynamic reconfigurability such that the microarchitectural capability and its associated flexibility achieves an extra degree of adaptability with regard to the DNN topology and processing hyper-parameters (e.g., batch size). Experimental results show that Canopy improves the performance by 5.85 × (4.66 ×) and reduces the energy by 34% (90%) when inferencing a single (a batch of) input compared to the baseline design under an iso-area comparison across seven DNN workloads.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122665311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep feed-forward Spiking Neural Networks (SNNs) trained using appropriate learning algorithms have been shown to match the performance of state-of-the-art Artificial Neural Networks (ANNs). The inputs to an SNN layer are 1-bit spikes distributed over several timesteps. In addition, along with the standard artificial neural network (ANN) data structures, SNNs require one additional data structure – the membrane potential (Vmem) for each neuron which is updated every timestep. Hence, the dataflow requirements for energy-efficient hardware implementation of SNNs can be different from the standard ANNs. In this paper, we propose optimal dataflows for deep spiking neural network layers. To evaluate the energy and latency of different dataflows, we considered three hardware architectures with varying on-chip resources to represent a class of spatial accelerators. We developed a set of rules leading to optimum dataflow for SNNs that achieve more than 90% improvement in Energy-Delay Product (EDP) compared to the baseline for some workloads and architectures.
{"title":"Identifying Efficient Dataflows for Spiking Neural Networks","authors":"Deepika Sharma, Aayush Ankit, K. Roy","doi":"10.1145/3531437.3539704","DOIUrl":"https://doi.org/10.1145/3531437.3539704","url":null,"abstract":"Deep feed-forward Spiking Neural Networks (SNNs) trained using appropriate learning algorithms have been shown to match the performance of state-of-the-art Artificial Neural Networks (ANNs). The inputs to an SNN layer are 1-bit spikes distributed over several timesteps. In addition, along with the standard artificial neural network (ANN) data structures, SNNs require one additional data structure – the membrane potential (Vmem) for each neuron which is updated every timestep. Hence, the dataflow requirements for energy-efficient hardware implementation of SNNs can be different from the standard ANNs. In this paper, we propose optimal dataflows for deep spiking neural network layers. To evaluate the energy and latency of different dataflows, we considered three hardware architectures with varying on-chip resources to represent a class of spatial accelerators. We developed a set of rules leading to optimum dataflow for SNNs that achieve more than 90% improvement in Energy-Delay Product (EDP) compared to the baseline for some workloads and architectures.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130117219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prattay Chowdhury, Chaitali Sathe, Benjamin Carrion Schaefer
With most VLSI design companies now being fabless it is imperative to develop methods to protect their Intellectual Property (IP). One approach that has become very popular due to its relative simplicity and practicality is logic locking. One of the problems with traditional locking mechanisms is that the locking circuitry is built into the netlist that the VLSI design company delivers to the foundry which has now access to the entire design including the locking mechanism. This implies that they could potentially tamper with this circuitry or reverse engineer it to obtain the locking key. One relatively new approach that has been coined logic locking through omission, or hardware redaction, maps a portion of the design to an embedded FPGA (eFPGA). The bitstream of the eFPGA now acts as the locking key. This new approach has been shown to be more secure as the foundry has no access to the bitstream during the manufacturing stage. The obvious drawbacks are the increase in design complexity and the area and performance overheads associated with the eFPGA. In this work we propose, to the best of our knowledge, the first attack on these type of new locking mechanisms by substituting the exact logic mapped onto the eFPGA by a synthesizable predictive model that replicates the behavior of the exact logic. We show that this approach is applicable in the context of approximate computing where hardware accelerators tolerate certain degree of errors at their outputs. Experimental results show that our proposed approach is very effective finding suitable predictive models while simultaneously reducing the overall power consumption.
{"title":"Predictive Model Attack for Embedded FPGA Logic Locking","authors":"Prattay Chowdhury, Chaitali Sathe, Benjamin Carrion Schaefer","doi":"10.1145/3531437.3539728","DOIUrl":"https://doi.org/10.1145/3531437.3539728","url":null,"abstract":"With most VLSI design companies now being fabless it is imperative to develop methods to protect their Intellectual Property (IP). One approach that has become very popular due to its relative simplicity and practicality is logic locking. One of the problems with traditional locking mechanisms is that the locking circuitry is built into the netlist that the VLSI design company delivers to the foundry which has now access to the entire design including the locking mechanism. This implies that they could potentially tamper with this circuitry or reverse engineer it to obtain the locking key. One relatively new approach that has been coined logic locking through omission, or hardware redaction, maps a portion of the design to an embedded FPGA (eFPGA). The bitstream of the eFPGA now acts as the locking key. This new approach has been shown to be more secure as the foundry has no access to the bitstream during the manufacturing stage. The obvious drawbacks are the increase in design complexity and the area and performance overheads associated with the eFPGA. In this work we propose, to the best of our knowledge, the first attack on these type of new locking mechanisms by substituting the exact logic mapped onto the eFPGA by a synthesizable predictive model that replicates the behavior of the exact logic. We show that this approach is applicable in the context of approximate computing where hardware accelerators tolerate certain degree of errors at their outputs. Experimental results show that our proposed approach is very effective finding suitable predictive models while simultaneously reducing the overall power consumption.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126711533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruicong Chen, H. Kung, A. Chandrakasan, Hae-Seung Lee
In this work, we propose the first bit-level sparsity-aware SAR ADC with direct hybrid encoding for signed expressions (HESE) for AIoT applications. ADCs are typically a bottleneck in reducing the energy consumption of analog neural networks (ANNs). For a pre-trained Convolutional Neural Network (CNN) inference, a HESE SAR for an ANN can reduce the number of non-zero signed digit terms to be output, and thus enables a reduction in energy along with the term quantization (TQ). The proposed SAR ADC directly produces the HESE signed-digit representation (SDR) using two thresholds per cycle for 2-bit look-ahead (LA). A prototype in 65nm shows that the HESE SAR provides sparsity encoding with a Walden FoM of 15.2fJ/conv.-step at 45MS/s. The core area is 0.072mm2.
{"title":"A Bit-level Sparsity-aware SAR ADC with Direct Hybrid Encoding for Signed Expressions for AIoT Applications","authors":"Ruicong Chen, H. Kung, A. Chandrakasan, Hae-Seung Lee","doi":"10.1145/3531437.3539700","DOIUrl":"https://doi.org/10.1145/3531437.3539700","url":null,"abstract":"In this work, we propose the first bit-level sparsity-aware SAR ADC with direct hybrid encoding for signed expressions (HESE) for AIoT applications. ADCs are typically a bottleneck in reducing the energy consumption of analog neural networks (ANNs). For a pre-trained Convolutional Neural Network (CNN) inference, a HESE SAR for an ANN can reduce the number of non-zero signed digit terms to be output, and thus enables a reduction in energy along with the term quantization (TQ). The proposed SAR ADC directly produces the HESE signed-digit representation (SDR) using two thresholds per cycle for 2-bit look-ahead (LA). A prototype in 65nm shows that the HESE SAR provides sparsity encoding with a Walden FoM of 15.2fJ/conv.-step at 45MS/s. The core area is 0.072mm2.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130940654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shida Zhang, Nael Mizanur Rahman, Venkata Chaitanya Krishna Chekuri, Carlos Tokunaga, S. Mukhopadhyay
This paper presents a simulation-based study to evaluate the effect of Hot Carrier Injection (HCI) on the characteristics of an on-chip, digitally-controlled, switched inductor voltage regulator (IVR) architecture. Our methodology integrates device-level aging models, circuit simulations in SPICE, and control loop simulations in Simulink. We characterize the effect of HCI on individual components of an IVR, and their combined effect on the efficiency and transient performance. Our analysis using an IVR designed in 65nm CMOS shows that aging of the power stages has a smaller impact on performance compared to that of the control loop. Further, we perform a comparative analysis to show that, with a 1.8V supply, HCI leads to higher aging-induced degradation of IVR than Negative Bias Temperature Instability (NBTI). Finally, our simulation shows that parasitic inductance near IVR input aggravates NBTI and parasitic capacitance near IVR output aggravates HCI effects on IVR’s performance.
{"title":"Analysis of the Effect of Hot Carrier Injection in An Integrated Inductive Voltage Regulator","authors":"Shida Zhang, Nael Mizanur Rahman, Venkata Chaitanya Krishna Chekuri, Carlos Tokunaga, S. Mukhopadhyay","doi":"10.1145/3531437.3539710","DOIUrl":"https://doi.org/10.1145/3531437.3539710","url":null,"abstract":"This paper presents a simulation-based study to evaluate the effect of Hot Carrier Injection (HCI) on the characteristics of an on-chip, digitally-controlled, switched inductor voltage regulator (IVR) architecture. Our methodology integrates device-level aging models, circuit simulations in SPICE, and control loop simulations in Simulink. We characterize the effect of HCI on individual components of an IVR, and their combined effect on the efficiency and transient performance. Our analysis using an IVR designed in 65nm CMOS shows that aging of the power stages has a smaller impact on performance compared to that of the control loop. Further, we perform a comparative analysis to show that, with a 1.8V supply, HCI leads to higher aging-induced degradation of IVR than Negative Bias Temperature Instability (NBTI). Finally, our simulation shows that parasitic inductance near IVR input aggravates NBTI and parasitic capacitance near IVR output aggravates HCI effects on IVR’s performance.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131136867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Micro-bump and hybrid bonding technologies have enabled 3D ICs and provided remarkable performance gain, but the memory macro partitioning problem also becomes more complicated due to the limited 3D connection density. In this paper, we evaluate and quantify the impacts of various macro partitioning on the performance and temperature in commercial-grade 3D ICs. In addition, we propose a set of partitioning guidelines and a quick constraint-graph-based approach to create floorplans for logic-on-memory 3D ICs. Experimental results show that the optimized macro partitioning can help improve the performance of logic-on-memory 3D ICs by up to 15%, at the cost of 8°C temperature increase. Assuming air cooling, our simulation shows the 3D ICs are thermally sustainable with 97°C maximum temperature.
{"title":"3D IC Tier Partitioning of Memory Macros: PPA vs. Thermal Tradeoffs","authors":"Lingjun Zhu, Nesara Eranna Bethur, Yi-Chen Lu, Youngsang Cho, Yunhyeok Im, S. Lim","doi":"10.1145/3531437.3539724","DOIUrl":"https://doi.org/10.1145/3531437.3539724","url":null,"abstract":"Micro-bump and hybrid bonding technologies have enabled 3D ICs and provided remarkable performance gain, but the memory macro partitioning problem also becomes more complicated due to the limited 3D connection density. In this paper, we evaluate and quantify the impacts of various macro partitioning on the performance and temperature in commercial-grade 3D ICs. In addition, we propose a set of partitioning guidelines and a quick constraint-graph-based approach to create floorplans for logic-on-memory 3D ICs. Experimental results show that the optimized macro partitioning can help improve the performance of logic-on-memory 3D ICs by up to 15%, at the cost of 8°C temperature increase. Assuming air cooling, our simulation shows the 3D ICs are thermally sustainable with 97°C maximum temperature.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126224855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anthony Agnesina, Moritz Brunion, A. Ortiz, F. Catthoor, D. Milojevic, M. Komalan, Matheus A. Cavalcante, Samuel Riedel, L. Benini, S. Lim
Hierarchical very-large-scale integration (VLSI) flows are an understudied yet critical approach to achieving design closure at giga-scale complexity and gigahertz frequency targets. This paper proposes a novel hierarchical physical design flow enabling the building of high-density and commercial-quality two-tier face-to-face-bonded hierarchical 3D ICs. We significantly reduce the associated manufacturing cost compared to existing 3D implementation flows and, for the first time, achieve cost competitiveness against the 2D reference in large modern designs. Experimental results on complex industrial and open manycore processors demonstrate in two advanced nodes that the proposed flow provides major power, performance, and area/cost (PPAC) improvements of 1.2 to 2.2 × compared with 2D, where all metrics are improved simultaneously, including up to power savings.
{"title":"Hier-3D: A Hierarchical Physical Design Methodology for Face-to-Face-Bonded 3D ICs","authors":"Anthony Agnesina, Moritz Brunion, A. Ortiz, F. Catthoor, D. Milojevic, M. Komalan, Matheus A. Cavalcante, Samuel Riedel, L. Benini, S. Lim","doi":"10.1145/3531437.3539702","DOIUrl":"https://doi.org/10.1145/3531437.3539702","url":null,"abstract":"Hierarchical very-large-scale integration (VLSI) flows are an understudied yet critical approach to achieving design closure at giga-scale complexity and gigahertz frequency targets. This paper proposes a novel hierarchical physical design flow enabling the building of high-density and commercial-quality two-tier face-to-face-bonded hierarchical 3D ICs. We significantly reduce the associated manufacturing cost compared to existing 3D implementation flows and, for the first time, achieve cost competitiveness against the 2D reference in large modern designs. Experimental results on complex industrial and open manycore processors demonstrate in two advanced nodes that the proposed flow provides major power, performance, and area/cost (PPAC) improvements of 1.2 to 2.2 × compared with 2D, where all metrics are improved simultaneously, including up to power savings.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132752521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}