Resistive Memory (ReRAM) is an emerging non-volatile memory technology that has many advantages over conventional DRAM. ReRAM crossbar has the smallest 4F2 planar cell size and thus is widely adopted for constructing dense memory with large capacity. However, ReRAM crossbar suffers from large sneaky currents and IR drop. To ensure write reliability, ReRAM write drivers choose larger than ideal write voltages, which over-SET/over-RESET many cells at runtime and lead to severely degraded chip lifetime.In this paper, we propose XWL, a novel table based wear leveling scheme for ReRAM crossbars. We study the correlation between write endurance and voltage stress in ReRAM crossbar. By estimating and tracking the effective write stress to different rows at runtime, XWL chooses the ones that are stressed the most to mitigate. Our experimental results show that, on average, XWL improves the ReRAM crossbar lifetime by 324% over the baseline, with only 6.1% performance overhead.
{"title":"Wear Leveling for Crossbar Resistive Memory","authors":"Wen Wen, Youtao Zhang, Jun Yang","doi":"10.1145/3195970.3196138","DOIUrl":"https://doi.org/10.1145/3195970.3196138","url":null,"abstract":"Resistive Memory (ReRAM) is an emerging non-volatile memory technology that has many advantages over conventional DRAM. ReRAM crossbar has the smallest 4F2 planar cell size and thus is widely adopted for constructing dense memory with large capacity. However, ReRAM crossbar suffers from large sneaky currents and IR drop. To ensure write reliability, ReRAM write drivers choose larger than ideal write voltages, which over-SET/over-RESET many cells at runtime and lead to severely degraded chip lifetime.In this paper, we propose XWL, a novel table based wear leveling scheme for ReRAM crossbars. We study the correlation between write endurance and voltage stress in ReRAM crossbar. By estimating and tracking the effective write stress to different rows at runtime, XWL chooses the ones that are stressed the most to mitigate. Our experimental results show that, on average, XWL improves the ReRAM crossbar lifetime by 324% over the baseline, with only 6.1% performance overhead.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"13 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73041671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Xygkis, Lazaros Papadopoulos, D. Moloney, D. Soudris, Sofiane Yous
The implementation of Convolutional Neural Networks on edge Internet of Things (IoT) devices is a significant programming challenge, due to the limited computational resources and the real-time requirements of modern applications. This work focuses on the efficient implementation of the Winograd convolution, based on a set of application-independent and Winograd-specific software techniques for improving the utilization of the edge devices computational resources. The proposed techniques were evaluated in Intel/Movidius Myriad2 platform, using 4 CNNs of various computational requirements. The results show significant performance improvements, up to 54%, over other convolution algorithms.
{"title":"Efficient Winograd-based Convolution Kernel Implementation on Edge Devices","authors":"A. Xygkis, Lazaros Papadopoulos, D. Moloney, D. Soudris, Sofiane Yous","doi":"10.1145/3195970.3196041","DOIUrl":"https://doi.org/10.1145/3195970.3196041","url":null,"abstract":"The implementation of Convolutional Neural Networks on edge Internet of Things (IoT) devices is a significant programming challenge, due to the limited computational resources and the real-time requirements of modern applications. This work focuses on the efficient implementation of the Winograd convolution, based on a set of application-independent and Winograd-specific software techniques for improving the utilization of the edge devices computational resources. The proposed techniques were evaluated in Intel/Movidius Myriad2 platform, using 4 CNNs of various computational requirements. The results show significant performance improvements, up to 54%, over other convolution algorithms.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"3 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78854206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accelerator-rich architectures employ IOMMUs to support unified virtual address, but researches show that they fail to meet the performance and energy requirements of accelerators. Instead of optimizing the speed/energy of IOMMU address translation, this work tackles the issue from a new perspective, eliminating the need for translation with an active forwarding (AF) mechanism that forwards input data of accelerators directly from the CPU cache to the scratchpad memory of the accelerator. Results show that on average, AF can provide 8% performance improvement compared to the state-of-the-art mechanism, hostPageWalk, and reduce 22.1% accelerator power.
{"title":"Active Forwarding: Eliminate IOMMU Address Translation for Accelerator-rich Architectures","authors":"H. Fu, Po-Han Wang, Chia-Lin Yang","doi":"10.1145/3195970.3195984","DOIUrl":"https://doi.org/10.1145/3195970.3195984","url":null,"abstract":"Accelerator-rich architectures employ IOMMUs to support unified virtual address, but researches show that they fail to meet the performance and energy requirements of accelerators. Instead of optimizing the speed/energy of IOMMU address translation, this work tackles the issue from a new perspective, eliminating the need for translation with an active forwarding (AF) mechanism that forwards input data of accelerators directly from the CPU cache to the scratchpad memory of the accelerator. Results show that on average, AF can provide 8% performance improvement compared to the state-of-the-art mechanism, hostPageWalk, and reduce 22.1% accelerator power.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"43 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78994053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Ghodrati, Bilgiday Yuce, S. Gujar, Chinmay Deshpande, L. Nazhandali, P. Schaumont
Electromagnetic fault injection (EMFI) is an efficient class of physical attacks that can compromise the immunity of secure cryptographic algorithms. Despite successful EMFI attacks, the effects of electromagnetic injection (EM) on a processor are not well understood. This paper presents a bottom-up analysis of EMFI effects on a RISC microprocessor. We study these effects at three levels: at the wire-level, at the chip-network level, and at the gate-level considering parameters such as EM-injection location and timing. We conclude that EMFI induces local timing errors implying current timing attack detection and prevention techniques can be adapted to overcome EMFI.
{"title":"Inducing Local Timing Fault Through EM Injection","authors":"M. Ghodrati, Bilgiday Yuce, S. Gujar, Chinmay Deshpande, L. Nazhandali, P. Schaumont","doi":"10.1145/3195970.3196064","DOIUrl":"https://doi.org/10.1145/3195970.3196064","url":null,"abstract":"Electromagnetic fault injection (EMFI) is an efficient class of physical attacks that can compromise the immunity of secure cryptographic algorithms. Despite successful EMFI attacks, the effects of electromagnetic injection (EM) on a processor are not well understood. This paper presents a bottom-up analysis of EMFI effects on a RISC microprocessor. We study these effects at three levels: at the wire-level, at the chip-network level, and at the gate-level considering parameters such as EM-injection location and timing. We conclude that EMFI induces local timing errors implying current timing attack detection and prevention techniques can be adapted to overcome EMFI.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"17 6 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78493591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ReRAM-based systems are attractive implementation alternatives for neuromorphic computing because of their high speed and low design cost. In this work, we investigate the impact of temperature on the ReRAM-based neuromorphic architectures and show how varying temperatures have a negative impact on the computation accuracy. We first classify ReRAM crossbar cells based on their temperature and identify effective neural network weights that have large impacts on network outputs. Then, we propose a novel temperature-aware training and mapping scheme to prevent the effective weights from being mapped to hot cells to restore the system accuracy. Evaluation results for a two-layer neural network show that our scheme can improve the system accuracy by up to 39.2%.
{"title":"Thermal-aware Optimizations of ReRAM-based Neuromorphic Computing Systems","authors":"Majed Valad Beigi, G. Memik","doi":"10.1145/3195970.3196128","DOIUrl":"https://doi.org/10.1145/3195970.3196128","url":null,"abstract":"ReRAM-based systems are attractive implementation alternatives for neuromorphic computing because of their high speed and low design cost. In this work, we investigate the impact of temperature on the ReRAM-based neuromorphic architectures and show how varying temperatures have a negative impact on the computation accuracy. We first classify ReRAM crossbar cells based on their temperature and identify effective neural network weights that have large impacts on network outputs. Then, we propose a novel temperature-aware training and mapping scheme to prevent the effective weights from being mapped to hot cells to restore the system accuracy. Evaluation results for a two-layer neural network show that our scheme can improve the system accuracy by up to 39.2%.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"84 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82104622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jungmin Park, Xiaolin Xu, Yier Jin, Domenic Forte, M. Tehranipoor
Modern embedded computing devices are vulnerable against mal-ware and software piracy due to insufficient security scrutiny and the complications of continuous patching. To detect malicious activity as well as protecting the integrity of executable software, it is necessary to monitor the operation of such devices. In this paper, we propose a disassembler based on power-based side-channel to analyze the real-time operation of embedded systems at instruction-level granularity. The proposed disassembler obtains templates from an original device (e.g., IoT home security system, smart thermostat, etc.) and utilizes machine learning algorithms to uniquely identify instructions executed on the device. The feature selection using Kullback-Leibler (KL) divergence and the dimensional reduction using PCA in the time-frequency domain are proposed to increase the identification accuracy. Moreover, a hierarchical classification framework is proposed to reduce the computational complexity associated with large instruction sets. In addition, covariate shifts caused by different environmental measurements and device-to-device variations are minimized by our covariate shift adaptation technique. We implement this disassembler on an AVR 8-bit microcontroller. Experimental results demonstrate that our proposed disassembler can recognize test instructions including register names with a success rate no lower than 99.03% with quadratic discriminant analysis (QDA).
{"title":"Power-based Side-Channel Instruction-level Disassembler","authors":"Jungmin Park, Xiaolin Xu, Yier Jin, Domenic Forte, M. Tehranipoor","doi":"10.1145/3195970.3196094","DOIUrl":"https://doi.org/10.1145/3195970.3196094","url":null,"abstract":"Modern embedded computing devices are vulnerable against mal-ware and software piracy due to insufficient security scrutiny and the complications of continuous patching. To detect malicious activity as well as protecting the integrity of executable software, it is necessary to monitor the operation of such devices. In this paper, we propose a disassembler based on power-based side-channel to analyze the real-time operation of embedded systems at instruction-level granularity. The proposed disassembler obtains templates from an original device (e.g., IoT home security system, smart thermostat, etc.) and utilizes machine learning algorithms to uniquely identify instructions executed on the device. The feature selection using Kullback-Leibler (KL) divergence and the dimensional reduction using PCA in the time-frequency domain are proposed to increase the identification accuracy. Moreover, a hierarchical classification framework is proposed to reduce the computational complexity associated with large instruction sets. In addition, covariate shifts caused by different environmental measurements and device-to-device variations are minimized by our covariate shift adaptation technique. We implement this disassembler on an AVR 8-bit microcontroller. Experimental results demonstrate that our proposed disassembler can recognize test instructions including register names with a success rate no lower than 99.03% with quadratic discriminant analysis (QDA).","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"7 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82251966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-volatile memory express (NVMe) over peripheral component interconnect express (PCIe) has been adopted in the storage system to provide low latency and high throughput. NVMe allows a host system to reduce latency because it offers a high parallel operation and optimized command processing flow. In addition, an introduction of emerging non-volatile memory (NVM) significantly reduces the solid state drive (SSD) latency. The latency reduction in the host system and SSD makes a relative ratio of PCIe fabric latency to total I/O latency considerably grow. Therefore, this paper proposes a novel I/O optimization method using the PCIe feature, virtual channel. Unlike conventional approaches with the same priority data path, based on SSD's internal latency, an emerging NVM-based NVMe SSD with the proposed architecture selects a prioritized virtual channel to provide deterministic I/O latency. Experimental results show that the proposed method with phase-change memory (PCM) SSD improves I/O determinism by processing 45~74% more commands within the predictable I/O latency than a conventional PCM SSD.
{"title":"Optimized I/O Determinism for Emerging NVM-based NVMe SSD in an Enterprise System","authors":"Seonbong Kim, Joon-Sung Yang","doi":"10.1145/3195970.3196085","DOIUrl":"https://doi.org/10.1145/3195970.3196085","url":null,"abstract":"Non-volatile memory express (NVMe) over peripheral component interconnect express (PCIe) has been adopted in the storage system to provide low latency and high throughput. NVMe allows a host system to reduce latency because it offers a high parallel operation and optimized command processing flow. In addition, an introduction of emerging non-volatile memory (NVM) significantly reduces the solid state drive (SSD) latency. The latency reduction in the host system and SSD makes a relative ratio of PCIe fabric latency to total I/O latency considerably grow. Therefore, this paper proposes a novel I/O optimization method using the PCIe feature, virtual channel. Unlike conventional approaches with the same priority data path, based on SSD's internal latency, an emerging NVM-based NVMe SSD with the proposed architecture selects a prioritized virtual channel to provide deterministic I/O latency. Experimental results show that the proposed method with phase-change memory (PCM) SSD improves I/O determinism by processing 45~74% more commands within the predictable I/O latency than a conventional PCM SSD.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"123 1 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82877180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}