Haifeng Gu, Mingsong Chen, Tongquan Wei, Li Lei, Fei Xie
Due to the increasing complexity of System-on-Chip (SoC) design, how to ensure that silicon implementations conform to their high-level specifications is becoming a major challenge. To address this problem, we propose a novel specification-driven conformance checking approach that can automatically identify inconsistencies between different levels of designs. By extending SystemRDL specifications, our approach enables the generation of high-level Formal Device Models (FDMs) that specify access behaviors of interface registers triggered by driver requests. Based on the symbolic execution of the generated FDMs with the same driver requests to virtual/silicon devices, our approach can efficiently check whether the designs of an SoC at different levels exhibit unexpected behaviors that are not modeled in the given specification. Experiments on two industrial network adapters demonstrate the effectiveness of our approach in troubleshooting bugs caused by inconsistencies in both virtual and post-silicon prototypes.
{"title":"Specification-Driven Automated Conformance Checking for Virtual Prototype and Post-Silicon Designs","authors":"Haifeng Gu, Mingsong Chen, Tongquan Wei, Li Lei, Fei Xie","doi":"10.1145/3195970.3196119","DOIUrl":"https://doi.org/10.1145/3195970.3196119","url":null,"abstract":"Due to the increasing complexity of System-on-Chip (SoC) design, how to ensure that silicon implementations conform to their high-level specifications is becoming a major challenge. To address this problem, we propose a novel specification-driven conformance checking approach that can automatically identify inconsistencies between different levels of designs. By extending SystemRDL specifications, our approach enables the generation of high-level Formal Device Models (FDMs) that specify access behaviors of interface registers triggered by driver requests. Based on the symbolic execution of the generated FDMs with the same driver requests to virtual/silicon devices, our approach can efficiently check whether the designs of an SoC at different levels exhibit unexpected behaviors that are not modeled in the given specification. Experiments on two industrial network adapters demonstrate the effectiveness of our approach in troubleshooting bugs caused by inconsistencies in both virtual and post-silicon prototypes.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"8 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86590352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brain-inspired Hyperdimensional (HD) computing emulates cognition tasks by computing with hypervectors rather than traditional numerical values. In HD, an encoder maps inputs to high dimensional vectors (hypervectors) and combines them to generate a model for each existing class. During inference, HD performs the task of reasoning by looking for similarities of the input hypervector and each pre-stored class hypervector However, there is not a unique encoding in HD which can perfectly map inputs to hypervectors. This results in low HD classification accuracy over complex tasks such as speech recognition. In this paper we propose MHD, a multi-encoder hierarchical classifier, which enables HD to take full advantages of multiple encoders without increasing the cost of classification. MHD consists of two HD stages: a main stage and a decider stage. The main stage makes use of multiple classifiers with different encoders to classify a wide range of input data. Each classifier in the main stage can trade between efficiency and accuracy by dynamically varying the hypervectors’ dimensions. The decider stage, located before the main stage, learns the difficulty of the input data and selects an encoder within the main stage that will provide the maximum accuracy, while also maximizing the efficiency of the classification task. We test the accuracy/efficiency of the proposed MHD on speech recognition application. Our evaluation shows that MHD can provide a 6.6× improvement in energy efficiency and a 6.3× speedup, as compared to baseline single level HD.
{"title":"Hierarchical Hyperdimensional Computing for Energy Efficient Classification","authors":"M. Imani, Chenyu Huang, Deqian Kong, T. Simunic","doi":"10.1145/3195970.3196060","DOIUrl":"https://doi.org/10.1145/3195970.3196060","url":null,"abstract":"Brain-inspired Hyperdimensional (HD) computing emulates cognition tasks by computing with hypervectors rather than traditional numerical values. In HD, an encoder maps inputs to high dimensional vectors (hypervectors) and combines them to generate a model for each existing class. During inference, HD performs the task of reasoning by looking for similarities of the input hypervector and each pre-stored class hypervector However, there is not a unique encoding in HD which can perfectly map inputs to hypervectors. This results in low HD classification accuracy over complex tasks such as speech recognition. In this paper we propose MHD, a multi-encoder hierarchical classifier, which enables HD to take full advantages of multiple encoders without increasing the cost of classification. MHD consists of two HD stages: a main stage and a decider stage. The main stage makes use of multiple classifiers with different encoders to classify a wide range of input data. Each classifier in the main stage can trade between efficiency and accuracy by dynamically varying the hypervectors’ dimensions. The decider stage, located before the main stage, learns the difficulty of the input data and selects an encoder within the main stage that will provide the maximum accuracy, while also maximizing the efficiency of the classification task. We test the accuracy/efficiency of the proposed MHD on speech recognition application. Our evaluation shows that MHD can provide a 6.6× improvement in energy efficiency and a 6.3× speedup, as compared to baseline single level HD.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"11 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84357135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shubham Jain, Swagath Venkataramani, V. Srinivasan, Jungwook Choi, P. Chuang, Le Chang
Deep Neural Networks (DNNs) represent the state-of-the-art in many Artificial Intelligence (AI) tasks involving images, videos, text, and natural language. Their ubiquitous adoption is limited by the high computation and storage requirements of DNNs, especially for energy-constrained inference tasks at the edge using wearable and IoT devices. One promising approach to alleviate the computational challenges is implementing DNNs using low-precision fixed point (<16 bits) representation. However, the quantization error inherent in any Fixed Point (FxP) implementation limits the choice of bit-widths to maintain application-level accuracy. Prior efforts recommend increasing the network size and/or re-training the DNN to minimize loss due to quantization, albeit with limited success.Complementary to the above approaches, we present Compensated-DNN, wherein we propose to dynamically compensate the error introduced due to quantization during execution. To this end, we introduce a new fixed-point representation viz. Fixed Point with Error Compensation (FPEC). The bits in FPEC are split between computation bits vs. compensation bits. The computation bits use conventional FxP notation to represent the number at low-precision. On the other hand, the compensation bits (1 or 2 bits at most) explicitly capture an estimate (direction and magnitude) of the quantization error in the representation. For a given word length, since FPEC uses fewer computation bits compared to FxP representation, we achieve a near-quadratic improvement in energy in the multiply-and-accumulate (MAC) operations. The compensation bits are simultaneously used by a low-overhead sparse compensation scheme to estimate the error accrued during MAC operations, which is then added to the MAC output to minimize the impact of quantization. We build compensated-DNNs for 7 popular image recognition benchmarks with 0.05–20.5 million neurons and 0.01–15.5 billion connections. Based on gate-level analysis at 14nm technology, we achieve 2.65 × –4.88 × and 1.13 × –1.7 × improvement in energy compared to 16-bit and 8-bit FxP implementations respectively, while maintaining <0.5% loss in classification accuracy.
{"title":"Compensated-DNN: Energy Efficient Low-Precision Deep Neural Networks by Compensating Quantization Errors","authors":"Shubham Jain, Swagath Venkataramani, V. Srinivasan, Jungwook Choi, P. Chuang, Le Chang","doi":"10.1145/3195970.3196012","DOIUrl":"https://doi.org/10.1145/3195970.3196012","url":null,"abstract":"Deep Neural Networks (DNNs) represent the state-of-the-art in many Artificial Intelligence (AI) tasks involving images, videos, text, and natural language. Their ubiquitous adoption is limited by the high computation and storage requirements of DNNs, especially for energy-constrained inference tasks at the edge using wearable and IoT devices. One promising approach to alleviate the computational challenges is implementing DNNs using low-precision fixed point (<16 bits) representation. However, the quantization error inherent in any Fixed Point (FxP) implementation limits the choice of bit-widths to maintain application-level accuracy. Prior efforts recommend increasing the network size and/or re-training the DNN to minimize loss due to quantization, albeit with limited success.Complementary to the above approaches, we present Compensated-DNN, wherein we propose to dynamically compensate the error introduced due to quantization during execution. To this end, we introduce a new fixed-point representation viz. Fixed Point with Error Compensation (FPEC). The bits in FPEC are split between computation bits vs. compensation bits. The computation bits use conventional FxP notation to represent the number at low-precision. On the other hand, the compensation bits (1 or 2 bits at most) explicitly capture an estimate (direction and magnitude) of the quantization error in the representation. For a given word length, since FPEC uses fewer computation bits compared to FxP representation, we achieve a near-quadratic improvement in energy in the multiply-and-accumulate (MAC) operations. The compensation bits are simultaneously used by a low-overhead sparse compensation scheme to estimate the error accrued during MAC operations, which is then added to the MAC output to minimize the impact of quantization. We build compensated-DNNs for 7 popular image recognition benchmarks with 0.05–20.5 million neurons and 0.01–15.5 billion connections. Based on gate-level analysis at 14nm technology, we achieve 2.65 × –4.88 × and 1.13 × –1.7 × improvement in energy compared to 16-bit and 8-bit FxP implementations respectively, while maintaining <0.5% loss in classification accuracy.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"96 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85301195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X-values in test output responses corrupt an output response compaction and can cause a fault coverage loss. X-Masking and X-Canceling MISR methods have been suggested to eliminate X-values, however, there are control data volume and test time overhead issues. These issues become significant as the complexity and the density of the circuits increase. This paper proposes a method to eliminate X's by applying a scan slice granularity X-value correlation analysis. The proposed method exploits scan slice correlation analysis, determines unique control data for the scan slice groups sharing the same control data, and applies them for each scan slice. Hence, the volume of control data can be significantly reduced. The simulation results demonstrate that the proposed method achieves greater control data and test time reduction compared to the conventional methods, without loss of fault coverage.
{"title":"Test Cost Reduction for X-Value Elimination By Scan Slice Correlation Analysis","authors":"Hyunsu Chae, Joon-Sung Yang","doi":"10.1145/3195970.3196127","DOIUrl":"https://doi.org/10.1145/3195970.3196127","url":null,"abstract":"X-values in test output responses corrupt an output response compaction and can cause a fault coverage loss. X-Masking and X-Canceling MISR methods have been suggested to eliminate X-values, however, there are control data volume and test time overhead issues. These issues become significant as the complexity and the density of the circuits increase. This paper proposes a method to eliminate X's by applying a scan slice granularity X-value correlation analysis. The proposed method exploits scan slice correlation analysis, determines unique control data for the scan slice groups sharing the same control data, and applies them for each scan slice. Hence, the volume of control data can be significantly reduced. The simulation results demonstrate that the proposed method achieves greater control data and test time reduction compared to the conventional methods, without loss of fault coverage.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"2 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82722701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Pal, Abhishek Sharma, S. Ray, F. M. D. Paula, Shobha Vasudevan
We present a method for selecting trace messages for post-silicon validation of Systems-on-a-Chips (SoCs) with diverse usage scenarios. We model specifications of interacting flows in typical applications. Our method optimizes trace buffer utilization and flow specification coverage. We present debugging and root cause analysis of subtle bugs in the industry scale OpenSPARC T2 processor. We demonstrate that this scale is beyond the capacity of current tracing approaches. We achieve trace buffer utilization of 98.96% with a flow specification coverage of 94.3% (average). We localize bugs to 21.11% (average) of the potential root causes in our large-scale debugging effort.
{"title":"Application Level Hardware Tracing for Scaling Post-Silicon Debug","authors":"D. Pal, Abhishek Sharma, S. Ray, F. M. D. Paula, Shobha Vasudevan","doi":"10.1145/3195970.3195992","DOIUrl":"https://doi.org/10.1145/3195970.3195992","url":null,"abstract":"We present a method for selecting trace messages for post-silicon validation of Systems-on-a-Chips (SoCs) with diverse usage scenarios. We model specifications of interacting flows in typical applications. Our method optimizes trace buffer utilization and flow specification coverage. We present debugging and root cause analysis of subtle bugs in the industry scale OpenSPARC T2 processor. We demonstrate that this scale is beyond the capacity of current tracing approaches. We achieve trace buffer utilization of 98.96% with a flow specification coverage of 94.3% (average). We localize bugs to 21.11% (average) of the potential root causes in our large-scale debugging effort.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"68 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78722216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An Zou, Jingwen Leng, Xin He, Yazhou Zu, V. Reddi, Xuan Zhang
Voltage stacking (VS) fundamentally improves power delivery efficiency (PDE) by series-stacking multiple voltage domains to eliminate explicit step-down voltage conversion and reduce energy loss along the power delivery path. However, it suffers from aggravated supply noise, preventing its adoption in mainstream computing systems. In this paper, we investigate a practical approach to enabling efficient and reliable power delivery in voltage-stacked manycore systems that can ensure worst-case supply noise reliability without excessive costly over-design. We start by developing an analytical model to capture the essential noise behaviors in VS. It allows us to identify dominant noise contributor and derive the worst-case conditions. With this in-depth understanding, we propose a hybrid voltage regulation solution to effectively mitigate noise with worst-case guarantees. When evaluated with real-world benchmarks, our solution can achieve 93.8% power delivery efficiency, an improvement of 13.9% over the conventional baseline.
{"title":"Efficient and Reliable Power Delivery in Voltage-Stacked Manycore System with Hybrid Charge-Recycling Regulators","authors":"An Zou, Jingwen Leng, Xin He, Yazhou Zu, V. Reddi, Xuan Zhang","doi":"10.1145/3195970.3196037","DOIUrl":"https://doi.org/10.1145/3195970.3196037","url":null,"abstract":"Voltage stacking (VS) fundamentally improves power delivery efficiency (PDE) by series-stacking multiple voltage domains to eliminate explicit step-down voltage conversion and reduce energy loss along the power delivery path. However, it suffers from aggravated supply noise, preventing its adoption in mainstream computing systems. In this paper, we investigate a practical approach to enabling efficient and reliable power delivery in voltage-stacked manycore systems that can ensure worst-case supply noise reliability without excessive costly over-design. We start by developing an analytical model to capture the essential noise behaviors in VS. It allows us to identify dominant noise contributor and derive the worst-case conditions. With this in-depth understanding, we propose a hybrid voltage regulation solution to effectively mitigate noise with worst-case guarantees. When evaluated with real-world benchmarks, our solution can achieve 93.8% power delivery efficiency, an improvement of 13.9% over the conventional baseline.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"7 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78876127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Kanduri, A. Miele, A. Rahmani, P. Liljeberg, C. Bolchini, N. Dutt
Run-time resource management of heterogeneous multi-core systems is challenging due to i) dynamic workloads, that often result in ii) conflicting knob actuation decisions, which potentially iii) compromise on performance for thermal safety. We present a runtime resource management strategy for performance guarantees under power constraints using functionally approximate kernels that exploit accuracy-performance trade-offs within error resilient applications. Our controller integrates approximation with power knobs - DVFS, CPU quota, task migration - in coordinated manner to make performance-aware decisions on power management under variable workloads. Experimental results on Odroid XU3 show the effectiveness of this strategy in meeting performance requirements without power violations compared to existing solutions.
{"title":"Approximation-Aware Coordinated Power/Performance Management for Heterogeneous Multi-cores","authors":"A. Kanduri, A. Miele, A. Rahmani, P. Liljeberg, C. Bolchini, N. Dutt","doi":"10.1145/3195970.3195994","DOIUrl":"https://doi.org/10.1145/3195970.3195994","url":null,"abstract":"Run-time resource management of heterogeneous multi-core systems is challenging due to i) dynamic workloads, that often result in ii) conflicting knob actuation decisions, which potentially iii) compromise on performance for thermal safety. We present a runtime resource management strategy for performance guarantees under power constraints using functionally approximate kernels that exploit accuracy-performance trade-offs within error resilient applications. Our controller integrates approximation with power knobs - DVFS, CPU quota, task migration - in coordinated manner to make performance-aware decisions on power management under variable workloads. Experimental results on Odroid XU3 show the effectiveness of this strategy in meeting performance requirements without power violations compared to existing solutions.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"12 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84956573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With increasing dimension of variation space and computational intensive circuit simulation, accurate and fast yield estimation of realistic SRAM chip remains a significant and complicated challenge. In this paper, du Experiment results show that the proposed method has an almost constant time complexity as the dimension increases, and gains 6× speedup over the state-of-the- art method in the 485D cases.
{"title":"An Efficient Bayesian Yield Estimation Method for High Dimensional and High Sigma SRAM Circuits","authors":"Jinyuan Zhai, Changhao Yan, Sheng-Guo Wang, Dian Zhou","doi":"10.1145/3195970.3195987","DOIUrl":"https://doi.org/10.1145/3195970.3195987","url":null,"abstract":"With increasing dimension of variation space and computational intensive circuit simulation, accurate and fast yield estimation of realistic SRAM chip remains a significant and complicated challenge. In this paper, du Experiment results show that the proposed method has an almost constant time complexity as the dimension increases, and gains 6× speedup over the state-of-the- art method in the 485D cases.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85063142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a technique to reduce compulsory misses of packet processing cache (PPC), which largely affects both throughput and energy of core routers. Rather than prefetching data, our technique called response prediction cache (RPC) speculatively stores predicted data into PPC without additional access to the low-throughput and power-consuming memory (i.e., TCAM). RPC predicts the data related to a response flow at the arrival of the corresponding request flow, based on the request-response model of internet communications. RPC can improve the cache miss rate, throughput, and energy-efficiency of PPC systems by 15.3%, 17.9%, and 17.8%, respectively.
{"title":"Data Prediction for Response Flows in Packet Processing Cache","authors":"Hayato Yamaki, H. Nishi, Shinobu Miwa, H. Honda","doi":"10.1145/3195970.3196021","DOIUrl":"https://doi.org/10.1145/3195970.3196021","url":null,"abstract":"We propose a technique to reduce compulsory misses of packet processing cache (PPC), which largely affects both throughput and energy of core routers. Rather than prefetching data, our technique called response prediction cache (RPC) speculatively stores predicted data into PPC without additional access to the low-throughput and power-consuming memory (i.e., TCAM). RPC predicts the data related to a response flow at the arrival of the corresponding request flow, based on the request-response model of internet communications. RPC can improve the cache miss rate, throughput, and energy-efficiency of PPC systems by 15.3%, 17.9%, and 17.8%, respectively.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"13 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83401202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}