H. Sayadi, Nisarg Patel, Sai Manoj P D, Avesta Sasan, S. Rafatirad, H. Homayoun
Malware detection at the hardware level has emerged recently as a promising solution to improve the security of computing systems. Hardware-based malware detectors take advantage of Machine Learning (ML) classifiers to detect pattern of malicious applications at run-time. These ML classifiers are trained using low-level features such as processor Hardware Performance Counters (HPCs) data which are captured at run-time to appropriately represent the application behaviour. Recent studies show the potential of standard ML-based classifiers for detecting malware using analysis of large number of microarchitectural events, more than the very limited number of HPC registers available in today’s microprocessors which varies from 2 to 8. This results in executing the application more than once to collect the required data, which in turn makes the solution less practical for effective run-time malware detection. Our results show a clear trade-off between the performance of standard ML classifiers and the number and diversity of HPCs available in modern microprocessors. This paper proposes a machine learning-based solution to break this trade-off to realize effective run-time detection of malware. We propose ensemble learning techniques to improve the performance of the hardware-based malware detectors despite using a very small number of microarchitectural events that are captured at run-time by existing HPCs, eliminating the need to run an application several times. For this purpose, eight robust machine learning models and two well-known ensemble learning classifiers applied on all studied ML models (sixteen in total) are implemented for malware detection and precisely compared and characterized in terms of detection accuracy, robustness, performance (accuracy × robustness), and hardware overheads. The experimental results show that the proposed ensemble learning-based malware detection with just 2 HPCs using ensemble technique outperforms standard classifiers with 8 HPCs by up to 17%. In addition, it can match the robustness and performance of standard ML-based detectors with 16 HPCs while using only 4 HPCs allowing effective run-time detection of malware.
{"title":"Ensemble Learning for Effective Run-Time Hardware-Based Malware Detection: A Comprehensive Analysis and Classification","authors":"H. Sayadi, Nisarg Patel, Sai Manoj P D, Avesta Sasan, S. Rafatirad, H. Homayoun","doi":"10.1145/3195970.3196047","DOIUrl":"https://doi.org/10.1145/3195970.3196047","url":null,"abstract":"Malware detection at the hardware level has emerged recently as a promising solution to improve the security of computing systems. Hardware-based malware detectors take advantage of Machine Learning (ML) classifiers to detect pattern of malicious applications at run-time. These ML classifiers are trained using low-level features such as processor Hardware Performance Counters (HPCs) data which are captured at run-time to appropriately represent the application behaviour. Recent studies show the potential of standard ML-based classifiers for detecting malware using analysis of large number of microarchitectural events, more than the very limited number of HPC registers available in today’s microprocessors which varies from 2 to 8. This results in executing the application more than once to collect the required data, which in turn makes the solution less practical for effective run-time malware detection. Our results show a clear trade-off between the performance of standard ML classifiers and the number and diversity of HPCs available in modern microprocessors. This paper proposes a machine learning-based solution to break this trade-off to realize effective run-time detection of malware. We propose ensemble learning techniques to improve the performance of the hardware-based malware detectors despite using a very small number of microarchitectural events that are captured at run-time by existing HPCs, eliminating the need to run an application several times. For this purpose, eight robust machine learning models and two well-known ensemble learning classifiers applied on all studied ML models (sixteen in total) are implemented for malware detection and precisely compared and characterized in terms of detection accuracy, robustness, performance (accuracy × robustness), and hardware overheads. The experimental results show that the proposed ensemble learning-based malware detection with just 2 HPCs using ensemble technique outperforms standard classifiers with 8 HPCs by up to 17%. In addition, it can match the robustness and performance of standard ML-based detectors with 16 HPCs while using only 4 HPCs allowing effective run-time detection of malware.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"42 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85086347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Limited processing power and memory prevent realization of state of the art algorithms on the edge level. Offloading computations to the cloud comes with tradeoffs as compression techniques employed to conserve transmission bandwidth and energy adversely impact accuracy of the algorithm. In this paper, we propose collaborative processing to actively guide the output of the sensor to improve performance on the end application. We apply this methodology to smart surveillance specifically the task of object detection from video. Perceptual quality and object detection performance is characterized and improved under a variety of channel conditions.
{"title":"Edge-Cloud Collaborative Processing for Intelligent Internet of Things: A Case Study on Smart Surveillance","authors":"B. Mudassar, Jong Hwan Ko, S. Mukhopadhyay","doi":"10.1145/3195970.3196036","DOIUrl":"https://doi.org/10.1145/3195970.3196036","url":null,"abstract":"Limited processing power and memory prevent realization of state of the art algorithms on the edge level. Offloading computations to the cloud comes with tradeoffs as compression techniques employed to conserve transmission bandwidth and energy adversely impact accuracy of the algorithm. In this paper, we propose collaborative processing to actively guide the output of the sensor to improve performance on the end application. We apply this methodology to smart surveillance specifically the task of object detection from video. Perceptual quality and object detection performance is characterized and improved under a variety of channel conditions.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"18 4 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90780502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Mishchenko, R. Brayton, A. Petkovska, Mathias Soeken, L. Amarù, A. Domic
A representation of a Boolean function is canonical if, given a variable order, only one instance of the representation is possible for the function. A computation is canonical if the result depends only on the Boolean function and a variable order, and does not depend on how the function is represented and how the computation is implemented.In the context of Boolean satisfiability (SAT), canonicity of the computation implies that the result (a satisfying assignment for satisfiable instances and an abstraction of the unsat core for unsatisfiable instances) does not depend on the functional representation and the SAT solver used.This paper shows that SAT-based computations can be made canonical, even though the SAT solver is not using a canonical data structure. This brings advantages in EDA applications, such as irredundant sum of product (ISOP) computation, counter-example minimization, etc, where the uniqueness of solutions and/or improved quality of results justify a runtime overhead.
{"title":"Canonical Computation without Canonical Representation","authors":"A. Mishchenko, R. Brayton, A. Petkovska, Mathias Soeken, L. Amarù, A. Domic","doi":"10.1145/3195970.3196006","DOIUrl":"https://doi.org/10.1145/3195970.3196006","url":null,"abstract":"A representation of a Boolean function is canonical if, given a variable order, only one instance of the representation is possible for the function. A computation is canonical if the result depends only on the Boolean function and a variable order, and does not depend on how the function is represented and how the computation is implemented.In the context of Boolean satisfiability (SAT), canonicity of the computation implies that the result (a satisfying assignment for satisfiable instances and an abstraction of the unsat core for unsatisfiable instances) does not depend on the functional representation and the SAT solver used.This paper shows that SAT-based computations can be made canonical, even though the SAT solver is not using a canonical data structure. This brings advantages in EDA applications, such as irredundant sum of product (ISOP) computation, counter-example minimization, etc, where the uniqueness of solutions and/or improved quality of results justify a runtime overhead.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"126 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75827793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Today’s embedded systems operate under increasingly dynamic conditions. First, computational workloads can be either fluctuating or adjustable. Moreover, as many devices are battery-powered, it is common to have runtime power management technique, which results in dynamic power budget. This paper presents a design methodology for multi-core systems, based on dataflow specification, that can deal with various contexts. We optimize the original dataflow considering various working conditions, then, autonomously adapt it to a pre-defined optimal form in response to context changes. We show the effectiveness of the proposed technique with a real-life case study and synthetic benchmarks.
{"title":"Context-Aware Dataflow Adaptation Technique for Low-Power Multi-Core Embedded Systems","authors":"Hyeonseok Jung, Hoeseok Yang","doi":"10.1145/3195970.3196015","DOIUrl":"https://doi.org/10.1145/3195970.3196015","url":null,"abstract":"Today’s embedded systems operate under increasingly dynamic conditions. First, computational workloads can be either fluctuating or adjustable. Moreover, as many devices are battery-powered, it is common to have runtime power management technique, which results in dynamic power budget. This paper presents a design methodology for multi-core systems, based on dataflow specification, that can deal with various contexts. We optimize the original dataflow considering various working conditions, then, autonomously adapt it to a pre-defined optimal form in response to context changes. We show the effectiveness of the proposed technique with a real-life case study and synthetic benchmarks.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"15 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82915863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CPU-FPGA heterogeneous architectures feature flexible acceleration of many workloads to advance computational capabilities and energy efficiency in today’s datacenters. This advantage, however, is often overshadowed by the poor programmability of FPGAs. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. In this paper we propose the composable, parallel and pipeline (CPP) microarchitecture as an accelerator design template to substantially reduce the design space. Also, by introducing the CPP analytical model to capture the performance-resource trade-offs, we achieve efficient, analytical-based design space exploration. Furthermore, we develop the AutoAccel framework to automate the entire accelerator generation process. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels.
{"title":"Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture","authors":"J. Cong, Peng Wei, Cody Hao Yu, Peng Zhang","doi":"10.1145/3195970.3195999","DOIUrl":"https://doi.org/10.1145/3195970.3195999","url":null,"abstract":"CPU-FPGA heterogeneous architectures feature flexible acceleration of many workloads to advance computational capabilities and energy efficiency in today’s datacenters. This advantage, however, is often overshadowed by the poor programmability of FPGAs. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. In this paper we propose the composable, parallel and pipeline (CPP) microarchitecture as an accelerator design template to substantially reduce the design space. Also, by introducing the CPP analytical model to capture the performance-resource trade-offs, we achieve efficient, analytical-based design space exploration. Furthermore, we develop the AutoAccel framework to automate the entire accelerator generation process. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"6 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79520681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artur Mrowca, Thomas Pramsohler, S. Steinhorst, U. Baumgarten
In modern vehicles, high communication complexity requires cost-effective integration tests such as data-driven system verification with in-vehicle network traces. With the growing amount of traces, distributable Big Data solutions for analyses become essential to inspect massive amounts of traces. Such traces need to be processed systematically using automated procedures, as manual steps become infeasible due to loading and processing times in existing tools. Further, trace analyses require multiple domains to verify the system in terms of different aspects (e.g., specific functions) and thus, require solutions that can be parameterized towards respective domains. Existing solutions are not able to process such trace amounts in a flexible and automated manner. To overcome this, we introduce a fully automated and parallelizable end-to-end preprocessing framework that allows to analyze massive in-vehicle network traces. Being parameterized per domain, trace data is systematically reduced and extended with domain knowledge, yielding a representation targeted towards domain-specific system analyses. We show that our approach outperforms existing solutions in terms of execution time and extensibility by evaluating our approach on three real-world data sets from the automotive industry.
{"title":"Automated Interpretation and Reduction of In-Vehicle Network Traces at a Large Scale","authors":"Artur Mrowca, Thomas Pramsohler, S. Steinhorst, U. Baumgarten","doi":"10.1145/3195970.3196000","DOIUrl":"https://doi.org/10.1145/3195970.3196000","url":null,"abstract":"In modern vehicles, high communication complexity requires cost-effective integration tests such as data-driven system verification with in-vehicle network traces. With the growing amount of traces, distributable Big Data solutions for analyses become essential to inspect massive amounts of traces. Such traces need to be processed systematically using automated procedures, as manual steps become infeasible due to loading and processing times in existing tools. Further, trace analyses require multiple domains to verify the system in terms of different aspects (e.g., specific functions) and thus, require solutions that can be parameterized towards respective domains. Existing solutions are not able to process such trace amounts in a flexible and automated manner. To overcome this, we introduce a fully automated and parallelizable end-to-end preprocessing framework that allows to analyze massive in-vehicle network traces. Being parameterized per domain, trace data is systematically reduced and extended with domain knowledge, yielding a representation targeted towards domain-specific system analyses. We show that our approach outperforms existing solutions in terms of execution time and extensibility by evaluating our approach on three real-world data sets from the automotive industry.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"100 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72873714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. W. Ku, Yu Liu, Yingyezhe Jin, S. Samal, Peng Li, S. Lim
A liquid state machine (LSM) is a powerful recurrent spiking neural network shown to be effective in various learning tasks including speech recognition. In this work, we investigate design and architectural co-optimization to further improve the area-energy efficiency of LSM-based speech recognition processors with monolithic 3D IC (M3D) technology. We conduct fine-grained tier partitioning, where individual neurons are folded, and explore the impact of shared memory architecture and synaptic model complexity on the power-performance-area-accuracy (PPA) benefit of M3D LSM-based speech recognition. In training and classification tasks using spoken English letters, we obtain up to 70.0% PPAA savings over 2D ICs.
液态机(LSM)是一种强大的循环尖峰神经网络,在包括语音识别在内的各种学习任务中表现出有效的效果。在这项工作中,我们研究了设计和架构的协同优化,以进一步提高基于lsm的单片3D IC (M3D)技术的语音识别处理器的面积能源效率。我们进行了细粒度的层划分,其中单个神经元折叠,并探讨了共享内存架构和突触模型复杂性对基于M3D lsm的语音识别功率-性能-面积-精度(PPA)效益的影响。在使用英语口语字母的训练和分类任务中,我们比2D ic节省了高达70.0%的PPAA。
{"title":"Design and Architectural Co-optimization of Monolithic 3D Liquid State Machine-based Neuromorphic Processor","authors":"B. W. Ku, Yu Liu, Yingyezhe Jin, S. Samal, Peng Li, S. Lim","doi":"10.1145/3195970.3196024","DOIUrl":"https://doi.org/10.1145/3195970.3196024","url":null,"abstract":"A liquid state machine (LSM) is a powerful recurrent spiking neural network shown to be effective in various learning tasks including speech recognition. In this work, we investigate design and architectural co-optimization to further improve the area-energy efficiency of LSM-based speech recognition processors with monolithic 3D IC (M3D) technology. We conduct fine-grained tier partitioning, where individual neurons are folded, and explore the impact of shared memory architecture and synaptic model complexity on the power-performance-area-accuracy (PPA) benefit of M3D LSM-based speech recognition. In training and classification tasks using spoken English letters, we obtain up to 70.0% PPAA savings over 2D ICs.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"4324 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76565523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Intellectual Property (IP) theft costs semiconductor design companies billions of dollars every year. Unauthorized IP copies start from reverse engineering the given chip. Existing techniques to protect against IP theft aim to hide the IC’s functionality, but focus on manipulating the HDL descriptions. We propose TAO as a comprehensive solution based on high-level synthesis to raise the abstraction level and apply algorithmic obfuscation automatically. TAO includes several transformations that make the component hard to reverse engineer during chip fabrication, while a key is later inserted to unlock the functionality. Finally, this is a promising approach to obfuscate large-scale designs despite the hardware overhead needed to implement the obfuscation.
{"title":"TAO: Techniques for Algorithm-Level Obfuscation during High-Level Synthesis","authors":"C. Pilato, F. Regazzoni, R. Karri, S. Garg","doi":"10.1145/3195970.3196126","DOIUrl":"https://doi.org/10.1145/3195970.3196126","url":null,"abstract":"Intellectual Property (IP) theft costs semiconductor design companies billions of dollars every year. Unauthorized IP copies start from reverse engineering the given chip. Existing techniques to protect against IP theft aim to hide the IC’s functionality, but focus on manipulating the HDL descriptions. We propose TAO as a comprehensive solution based on high-level synthesis to raise the abstraction level and apply algorithmic obfuscation automatically. TAO includes several transformations that make the component hard to reverse engineer during chip fabrication, while a key is later inserted to unlock the functionality. Finally, this is a promising approach to obfuscate large-scale designs despite the hardware overhead needed to implement the obfuscation.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"18 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74429328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While the Internet of Things (IoT) keeps advancing, its full adoption is continually blocked by power delivery problems. One promising solution is Non-Volatile (NV) processors, which harvest energy for themselves and employ a NV memory hierarchy. This allows them to perform computations when power is available, checkpoint and hibernate when power is scarce, and resume their work at a later time. However, utilizing NV memory creates new security vulnerabilities in the form of wear out attacks in the register file. This paper explores the dangers of this design oversight and proposes a mitigation strategy that takes advantage of the unique properties and operating characteristics of NV processors. The proposed defense integrates the power management unit and a two-level register rotation approach, which improves NV processor endurance by 30.1× in attack situations and an average of 7.1× in standard workloads.
{"title":"A Collaborative Defense Against Wear Out Attacks in Non-Volatile Processors","authors":"P. Cronin, Chengmo Yang, Yongpan Liu","doi":"10.1145/3195970.3196825","DOIUrl":"https://doi.org/10.1145/3195970.3196825","url":null,"abstract":"While the Internet of Things (IoT) keeps advancing, its full adoption is continually blocked by power delivery problems. One promising solution is Non-Volatile (NV) processors, which harvest energy for themselves and employ a NV memory hierarchy. This allows them to perform computations when power is available, checkpoint and hibernate when power is scarce, and resume their work at a later time. However, utilizing NV memory creates new security vulnerabilities in the form of wear out attacks in the register file. This paper explores the dangers of this design oversight and proposes a mitigation strategy that takes advantage of the unique properties and operating characteristics of NV processors. The proposed defense integrates the power management unit and a two-level register rotation approach, which improves NV processor endurance by 30.1× in attack situations and an average of 7.1× in standard workloads.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"63 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80513502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vector-matrix multiplication (VMM) is a core operation in many signal and data processing algorithms. Previous work showed that analog multipliers based on nonvolatile memories have superior energy efficiency as compared to digital counterparts at low-to-medium computing precision. In this paper, we propose extremely energy efficient analog mode VMM circuit with digital input/output interface and configurable precision. Similar to some previous work, the computation is performed by gate-coupled circuit utilizing embedded floating gate (FG) memories. The main novelty of our approach is an ultra-low power sensing circuitry, which is designed based on translinear Gilbert cell in topological combination with a floating resistor and a low-gain amplifier. Additionally, the digital-to-analog input conversion is merged with VMM, while current-mode algorithmic analog-to-digital circuit is employed at the circuit backend. Such implementations of conversion and sensing allow for circuit operation entirely in a current domain, resulting in high performance and energy efficiency. For example, post-layout simulation results for 400 × 400 5-bit VMM circuit designed in 55 nm process with embedded NOR flash memory, show up to 400 MHz operation, 1.68 POps/J energy efficiency, and 39.45 TOps/mm2 computing throughput. Moreover, the circuit is robust against process-voltage-temperature variations, in part due to inclusion of additional FG cells that are utilized for offset compensation.1
{"title":"An Ultra-Low Energy Internally Analog, Externally Digital Vector-Matrix Multiplier Based on NOR Flash Memory Technology","authors":"M. Mahmoodi, D. Strukov","doi":"10.1145/3195970.3195989","DOIUrl":"https://doi.org/10.1145/3195970.3195989","url":null,"abstract":"Vector-matrix multiplication (VMM) is a core operation in many signal and data processing algorithms. Previous work showed that analog multipliers based on nonvolatile memories have superior energy efficiency as compared to digital counterparts at low-to-medium computing precision. In this paper, we propose extremely energy efficient analog mode VMM circuit with digital input/output interface and configurable precision. Similar to some previous work, the computation is performed by gate-coupled circuit utilizing embedded floating gate (FG) memories. The main novelty of our approach is an ultra-low power sensing circuitry, which is designed based on translinear Gilbert cell in topological combination with a floating resistor and a low-gain amplifier. Additionally, the digital-to-analog input conversion is merged with VMM, while current-mode algorithmic analog-to-digital circuit is employed at the circuit backend. Such implementations of conversion and sensing allow for circuit operation entirely in a current domain, resulting in high performance and energy efficiency. For example, post-layout simulation results for 400 × 400 5-bit VMM circuit designed in 55 nm process with embedded NOR flash memory, show up to 400 MHz operation, 1.68 POps/J energy efficiency, and 39.45 TOps/mm2 computing throughput. Moreover, the circuit is robust against process-voltage-temperature variations, in part due to inclusion of additional FG cells that are utilized for offset compensation.1","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"52 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85722992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}