Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00049
Purab Ranjan Sutradhar, K. Basu, Sai Manoj Pudukotai Dinakarrao, A. Ganguly
Processing in Memory (PIM), a non-von Neumann computing paradigm, has emerged as a faster and more efficient alternative to the traditional computing devices for data-centric applications such as Data Encryption. In this work, we present a novel PIM architecture implemented using programmable Lookup Tables (LUT) inside a DRAM chip to facilitate massively parallel and ultra-efficient data encryption with the Advanced Encryption Standard (AES) algorithm. Its LUT-based architecture replaces logic-based computations with LUT ‘look-ups’ to minimize power consumption and operational latency. The proposed PIM architecture is organized as clusters of homogeneous, interconnected LUTs that can be dynamically programmed to execute operations required for performing AES encryption. Our simulations show that the proposed PIM architecture can offer up to 14.6× and 1.8× higher performance compared to CUDA-based implementation of AES Encryption on a high-end commodity GPU and a state-of-the-art GPU Computing Processor, respectively. At the same time, it also achieves 217× and 31.2× higher energy efficiency, respectively, than the aforementioned devices while performing AES Encryption.
{"title":"An Ultra-efficient Look-up Table based Programmable Processing in Memory Architecture for Data Encryption","authors":"Purab Ranjan Sutradhar, K. Basu, Sai Manoj Pudukotai Dinakarrao, A. Ganguly","doi":"10.1109/ICCD53106.2021.00049","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00049","url":null,"abstract":"Processing in Memory (PIM), a non-von Neumann computing paradigm, has emerged as a faster and more efficient alternative to the traditional computing devices for data-centric applications such as Data Encryption. In this work, we present a novel PIM architecture implemented using programmable Lookup Tables (LUT) inside a DRAM chip to facilitate massively parallel and ultra-efficient data encryption with the Advanced Encryption Standard (AES) algorithm. Its LUT-based architecture replaces logic-based computations with LUT ‘look-ups’ to minimize power consumption and operational latency. The proposed PIM architecture is organized as clusters of homogeneous, interconnected LUTs that can be dynamically programmed to execute operations required for performing AES encryption. Our simulations show that the proposed PIM architecture can offer up to 14.6× and 1.8× higher performance compared to CUDA-based implementation of AES Encryption on a high-end commodity GPU and a state-of-the-art GPU Computing Processor, respectively. At the same time, it also achieves 217× and 31.2× higher energy efficiency, respectively, than the aforementioned devices while performing AES Encryption.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130889276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00044
Suzhen Wu, Jiapeng Wu, Zhirong Shen, Zhihao Zhang, Zuocheng Wang, Bo Mao
Non-Volatile Memories (NVMs) have shown tremendous potential to be the next generation of main memory, yet they are still seriously hampered by the high write latency and limited endurance. In this paper, we first unveil via realworld benchmark analysis that the words within the same cache line showcase a high degree of similarity. We therefore present SimiEncode, a low-overhead and effective Similarity-based Encoding approach. SimiEncode relieves writes to NVMs by (1) generating a mask word with minimized differences to the words within a cache line, (2) encoding each word with the associated mask word by simple XOR operations, and (3) writing a single tag bit to indicate the resulting zero word after encoding. Our prototype implementation of SimiEncode and extensive evaluations driven by 15 state-of-the-art benchmarks demonstrate that, compared with existing approaches, SimiEncode significantly prolongs the lifetime and improves the performance. Importantly, SimiEncode is orthogonal to and can be easily incorporated into existing bit flipping optimizations.
{"title":"SimiEncode: A Similarity-based Encoding Scheme to Improve Performance and Lifetime of Non-Volatile Main Memory","authors":"Suzhen Wu, Jiapeng Wu, Zhirong Shen, Zhihao Zhang, Zuocheng Wang, Bo Mao","doi":"10.1109/ICCD53106.2021.00044","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00044","url":null,"abstract":"Non-Volatile Memories (NVMs) have shown tremendous potential to be the next generation of main memory, yet they are still seriously hampered by the high write latency and limited endurance. In this paper, we first unveil via realworld benchmark analysis that the words within the same cache line showcase a high degree of similarity. We therefore present SimiEncode, a low-overhead and effective Similarity-based Encoding approach. SimiEncode relieves writes to NVMs by (1) generating a mask word with minimized differences to the words within a cache line, (2) encoding each word with the associated mask word by simple XOR operations, and (3) writing a single tag bit to indicate the resulting zero word after encoding. Our prototype implementation of SimiEncode and extensive evaluations driven by 15 state-of-the-art benchmarks demonstrate that, compared with existing approaches, SimiEncode significantly prolongs the lifetime and improves the performance. Importantly, SimiEncode is orthogonal to and can be easily incorporated into existing bit flipping optimizations.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134007345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00064
Ziyi Chen, I. Savidis
In this paper, a novel field-programmable analog array (FPAA) fabric consisting of a 6x6 matrix of configurable analog blocks (CABs) is proposed. The implementation of programmable CABs eliminates the use of fixed analog sub-circuits. A unique routing strategy is developed within the CAB units that supports both differential and single-ended mode circuit configurations. The bandwidth limitation due to the routing switches of each individual CAB unit is compensated for through the use of a switch-less routing network between CABs. Algorithms and methodologies are developed to facilitate rapid implementation of analog circuits on the FPAA. The proposed FPAA fabric provides high operating speeds as compared to existing FPAA topologies, while providing greater configuration in the CAB units as compared to switch-less FPAAs. The FPAA core includes 498 programming switches and 14 global switchless interconnects, while occupying an area of 0.1 mm2 in a 65 nm CMOS process. The characteristic power consumption is approximately 24.6 mW for a supply voltage of 1.2 V. Circuits implemented on the proposed FPAA fabric include operational amplifiers (op amps), filters, oscillators, and frequency dividers. The reconfigured bandpass filter provides a center frequency of approximately 1.5 GHz, while the synthesized ring-oscillator and frequency divider support operating frequencies of up to 500 MHz.
{"title":"Reconfigurable Array for Analog Applications","authors":"Ziyi Chen, I. Savidis","doi":"10.1109/ICCD53106.2021.00064","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00064","url":null,"abstract":"In this paper, a novel field-programmable analog array (FPAA) fabric consisting of a 6x6 matrix of configurable analog blocks (CABs) is proposed. The implementation of programmable CABs eliminates the use of fixed analog sub-circuits. A unique routing strategy is developed within the CAB units that supports both differential and single-ended mode circuit configurations. The bandwidth limitation due to the routing switches of each individual CAB unit is compensated for through the use of a switch-less routing network between CABs. Algorithms and methodologies are developed to facilitate rapid implementation of analog circuits on the FPAA. The proposed FPAA fabric provides high operating speeds as compared to existing FPAA topologies, while providing greater configuration in the CAB units as compared to switch-less FPAAs. The FPAA core includes 498 programming switches and 14 global switchless interconnects, while occupying an area of 0.1 mm2 in a 65 nm CMOS process. The characteristic power consumption is approximately 24.6 mW for a supply voltage of 1.2 V. Circuits implemented on the proposed FPAA fabric include operational amplifiers (op amps), filters, oscillators, and frequency dividers. The reconfigured bandpass filter provides a center frequency of approximately 1.5 GHz, while the synthesized ring-oscillator and frequency divider support operating frequencies of up to 500 MHz.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131981452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00086
Jinghan Zhang, Mehrshad Zandigohar, G. Schirner
Heterogeneous Accelerator-rich (ACC-rich) platforms combining general-purpose cores and specialized HW Accelerators (ACCs) promise high-performance and low-power deployment of streaming applications, e.g. for video analytics, software-defined radio, and radar. In order to recover Non-Recurring Engineering (NRE) cost, a unified domain platform for a set of applications can be exploited, especially when applications have functional and structural similarities, which can benefit from common ACCs. However, identifying the most beneficial set of common ACCs is challenging, and current Design Space Exploration (DSE) methods for domain platform allocation suffer from a long exploration time bottleneck. In particular, compared to a traditional DSE, evaluating the performance of a platform for a domain of applications is much more time-consuming as binding exploration and evaluation for each application in the domain is required. Thus, a rapid domain performance evaluation is needed to speed up the exploration of the platform allocation.This paper introduces Rapid Domain Platform Performance Prediction (RDP3) methods to speed up the exploration in domain DSE. Key contributions are: (1) analyzing current domain DSE flow and its exploration time bottleneck; (2) introducing four RDP3 methods to speedup the evaluation of different platform allocations: Heuristic Processing (HP) estimation, Linear Regression (LR), Decision Tree Regression (DTR), and Multi-Layer Perceptron (MLP) predictions; (3) comparing the performance of these predictions and integrating the prediction into the current domain DSE. To evaluate the efficacy of RDP3, we explore 10K platforms capable of processing OpenVX domain applications. We demonstrate that RDP3-MLP as the most promising method can achieve a speedup of 17.5K times with only 0.001 mean square error compared to the current platform evaluation using the analytical model. Integrating RDP3-MLP into the existing domain DSE method GIDE [1] can save 80.8% exploration time while still resulting in the same output platform design.
{"title":"RDP3: Rapid Domain Platform Performance Prediction for Design Space Exploration","authors":"Jinghan Zhang, Mehrshad Zandigohar, G. Schirner","doi":"10.1109/ICCD53106.2021.00086","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00086","url":null,"abstract":"Heterogeneous Accelerator-rich (ACC-rich) platforms combining general-purpose cores and specialized HW Accelerators (ACCs) promise high-performance and low-power deployment of streaming applications, e.g. for video analytics, software-defined radio, and radar. In order to recover Non-Recurring Engineering (NRE) cost, a unified domain platform for a set of applications can be exploited, especially when applications have functional and structural similarities, which can benefit from common ACCs. However, identifying the most beneficial set of common ACCs is challenging, and current Design Space Exploration (DSE) methods for domain platform allocation suffer from a long exploration time bottleneck. In particular, compared to a traditional DSE, evaluating the performance of a platform for a domain of applications is much more time-consuming as binding exploration and evaluation for each application in the domain is required. Thus, a rapid domain performance evaluation is needed to speed up the exploration of the platform allocation.This paper introduces Rapid Domain Platform Performance Prediction (RDP3) methods to speed up the exploration in domain DSE. Key contributions are: (1) analyzing current domain DSE flow and its exploration time bottleneck; (2) introducing four RDP3 methods to speedup the evaluation of different platform allocations: Heuristic Processing (HP) estimation, Linear Regression (LR), Decision Tree Regression (DTR), and Multi-Layer Perceptron (MLP) predictions; (3) comparing the performance of these predictions and integrating the prediction into the current domain DSE. To evaluate the efficacy of RDP3, we explore 10K platforms capable of processing OpenVX domain applications. We demonstrate that RDP3-MLP as the most promising method can achieve a speedup of 17.5K times with only 0.001 mean square error compared to the current platform evaluation using the analytical model. Integrating RDP3-MLP into the existing domain DSE method GIDE [1] can save 80.8% exploration time while still resulting in the same output platform design.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132248892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00036
Ben Gu, Longfei Luo, Yina Lv, Changlong Li, Liang Shi
Over the last few years, hybrid solid-state drives (SSDs) have been widely adopted due to their high performance and high capacity. Devices equipped with hybrid SSDs can be utilized to cache files from the network for performance improvement. However, this paper finds an interesting observation, that is, the efficiency of hybrid SSDs is significantly degraded instead of improved when too much data is cached. This is because the internal mode switching between different types of flash memory is affected by the device utilization. This paper proposes a dynamic file cache optimization scheme for hybrid SSDs, DFCache, which optimizes the device’s efficiency and limits unreasonable space consumption. DFCache includes two key ideas, dynamic cache space management, and intelligent cache file sifting. DFCache is implemented in Linux kernel and tested under real hybrid SSDs. Experimental results show that the I/O performance outperforms the state-of-the-art by up to 3.7x.
{"title":"Dynamic File Cache Optimization for Hybrid SSDs with High-Density and Low-Cost Flash Memory","authors":"Ben Gu, Longfei Luo, Yina Lv, Changlong Li, Liang Shi","doi":"10.1109/ICCD53106.2021.00036","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00036","url":null,"abstract":"Over the last few years, hybrid solid-state drives (SSDs) have been widely adopted due to their high performance and high capacity. Devices equipped with hybrid SSDs can be utilized to cache files from the network for performance improvement. However, this paper finds an interesting observation, that is, the efficiency of hybrid SSDs is significantly degraded instead of improved when too much data is cached. This is because the internal mode switching between different types of flash memory is affected by the device utilization. This paper proposes a dynamic file cache optimization scheme for hybrid SSDs, DFCache, which optimizes the device’s efficiency and limits unreasonable space consumption. DFCache includes two key ideas, dynamic cache space management, and intelligent cache file sifting. DFCache is implemented in Linux kernel and tested under real hybrid SSDs. Experimental results show that the I/O performance outperforms the state-of-the-art by up to 3.7x.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114376608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00091
Lixian Ma, En Shao, Yueyuan Zhou, Guangming Tan
The wide application of machine learning technology promotes the generation of ML-as-a-Service(MLaaS), which is a serverless computing paradigm for rapidly deploying a trained model as a serving. However, it is a challenge to design an inference system that is capable of coping with large traffic for low latency and heterogeneous neural networks. It is difficult to adaptively configure multilevel parallelism in existing cloud inference systems for machine learning servings, particularly if the cluster has accelerators, such as GPUs, NPUs, FPGAs, etc. These issues lead to poor resource utilization and limit the system throughput. In this paper, we propose and implement a high-throughput inference system called WidePipe, which WidePipe leverages reinforcement learning to co-adapt resource allocation and batch size of request according to device status. We evaluated the performance of WidePipe for a large cluster with 1000 neural processing units in 250 nodes. Our experimental results show that WidePipe has a 2.11× higher throughput than current inference systems when deploying heterogeneous machine learning servings, meeting the service-level objectives for the response time.
{"title":"WidePipe: High-Throughput Deep Learning Inference System on a Cluster of Neural Processing Units","authors":"Lixian Ma, En Shao, Yueyuan Zhou, Guangming Tan","doi":"10.1109/ICCD53106.2021.00091","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00091","url":null,"abstract":"The wide application of machine learning technology promotes the generation of ML-as-a-Service(MLaaS), which is a serverless computing paradigm for rapidly deploying a trained model as a serving. However, it is a challenge to design an inference system that is capable of coping with large traffic for low latency and heterogeneous neural networks. It is difficult to adaptively configure multilevel parallelism in existing cloud inference systems for machine learning servings, particularly if the cluster has accelerators, such as GPUs, NPUs, FPGAs, etc. These issues lead to poor resource utilization and limit the system throughput. In this paper, we propose and implement a high-throughput inference system called WidePipe, which WidePipe leverages reinforcement learning to co-adapt resource allocation and batch size of request according to device status. We evaluated the performance of WidePipe for a large cluster with 1000 neural processing units in 250 nodes. Our experimental results show that WidePipe has a 2.11× higher throughput than current inference systems when deploying heterogeneous machine learning servings, meeting the service-level objectives for the response time.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133709963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00016
Lei Zhao, Youtao Zhang, Jun Yang
Future deep neural networks (DNNs) tend to grow deeper and contain more trainable weights. Although methods such as pruning and quantization are widely adopted to reduce DNN’s model size and computation, they are less applicable in the area of ReRAM-based DNN accelerators. On the one hand, because the cells in crossbars are accessed uniformly, it is difficult to explore fine-grained pruning in ReRAM-based DNN accelerators. On the other hand, aggressive quantization results in poor accuracy coupled with the low precision of ReRAM cells to represent weight values.In this paper, we propose BFlip – a novel model size and computation reduction technique – to share crossbars among multiple bit matrices. BFlip clusters similar bit matrices together, and finds a combination of row and column flips for each bit matrix to minimize its distance to the centroid of the cluster. Therefore, only the centroid bit matrix is stored in the crossbar, which is shared by all other bit matrices in that cluster. We also propose a calibration method to improve the accuracy as well as a ReRAM-based DNN accelerator to fully reap the storage and computation benefits of BFlip. Our experiments show that BFlip effectively reduces model size and computation with negligible accuracy impact. The proposed accelerator achieves 2.45 × speedup and 85% energy reduction over the ISAAC baseline.
{"title":"Flipping Bits to Share Crossbars in ReRAM-Based DNN Accelerator","authors":"Lei Zhao, Youtao Zhang, Jun Yang","doi":"10.1109/ICCD53106.2021.00016","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00016","url":null,"abstract":"Future deep neural networks (DNNs) tend to grow deeper and contain more trainable weights. Although methods such as pruning and quantization are widely adopted to reduce DNN’s model size and computation, they are less applicable in the area of ReRAM-based DNN accelerators. On the one hand, because the cells in crossbars are accessed uniformly, it is difficult to explore fine-grained pruning in ReRAM-based DNN accelerators. On the other hand, aggressive quantization results in poor accuracy coupled with the low precision of ReRAM cells to represent weight values.In this paper, we propose BFlip – a novel model size and computation reduction technique – to share crossbars among multiple bit matrices. BFlip clusters similar bit matrices together, and finds a combination of row and column flips for each bit matrix to minimize its distance to the centroid of the cluster. Therefore, only the centroid bit matrix is stored in the crossbar, which is shared by all other bit matrices in that cluster. We also propose a calibration method to improve the accuracy as well as a ReRAM-based DNN accelerator to fully reap the storage and computation benefits of BFlip. Our experiments show that BFlip effectively reduces model size and computation with negligible accuracy impact. The proposed accelerator achieves 2.45 × speedup and 85% energy reduction over the ISAAC baseline.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00093
Jingdian Ming, Wei Cheng, Yongbin Zhou, Huizhong Li
Due to its provable security and remarkable device-independence, masking has been widely accepted as a good algorithmic-level countermeasure against side-channel attacks. Subsequently, several code-based masking schemes are proposed to strengthen the original Boolean masking (BM) scheme, and Inner Product Masking (IPM) scheme is typically one of those. In this paper, we provide a framework, named analysis with predicted template (APT), for side-channel analysis against the IPM scheme. Following this framework, we propose two attacks based on maximum likelihood and Euclidean distance, respectively. To evaluate their efficiency, we perform simulated experiments on first-order BM and an optimal IPM scheme. The results show that our proposals are equivalent to a second-order CPA against BM scheme, but they are significantly efficient against an optimal IPM. In practical experiments based on an ARM Cortex-M4 architecture, the results of our proposals do not turn out well because of a few outliers in collected leakages. After filtering out these outliers, our proposals perform efficiently as expected. Finally, we argue that the side-channel security of IPM can be improved by keeping the vector L to be randomly selected from an elaborated small set.
{"title":"APT: Efficient Side-Channel Analysis Framework against Inner Product Masking Scheme","authors":"Jingdian Ming, Wei Cheng, Yongbin Zhou, Huizhong Li","doi":"10.1109/ICCD53106.2021.00093","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00093","url":null,"abstract":"Due to its provable security and remarkable device-independence, masking has been widely accepted as a good algorithmic-level countermeasure against side-channel attacks. Subsequently, several code-based masking schemes are proposed to strengthen the original Boolean masking (BM) scheme, and Inner Product Masking (IPM) scheme is typically one of those. In this paper, we provide a framework, named analysis with predicted template (APT), for side-channel analysis against the IPM scheme. Following this framework, we propose two attacks based on maximum likelihood and Euclidean distance, respectively. To evaluate their efficiency, we perform simulated experiments on first-order BM and an optimal IPM scheme. The results show that our proposals are equivalent to a second-order CPA against BM scheme, but they are significantly efficient against an optimal IPM. In practical experiments based on an ARM Cortex-M4 architecture, the results of our proposals do not turn out well because of a few outliers in collected leakages. After filtering out these outliers, our proposals perform efficiently as expected. Finally, we argue that the side-channel security of IPM can be improved by keeping the vector L to be randomly selected from an elaborated small set.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132389205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00039
V. Rao, Haden Ondricek, P. Kalla, Florian Enescu
This paper proposes a symbolic algebra approach for multi-target rectification of integer arithmetic circuits. The circuit is represented as a system of polynomials and rectified against a polynomial specification with computations modeled over the field of rationals. Given a set of nets as potential rectification targets, we formulate a check to ascertain the existence of rectification functions at these targets. Upon confirmation, we compute the patch functions collectively for the targets. In this regard, we show how to synthesize a logic sub-circuit from polynomial artifacts generated over the field of rationals. We present new mathematical contributions and results to substantiate this synthesis process. We present two approaches for patch function computation: a greedy approach that resolves the rectification functions for the targets and an approach that explores a subset of don’t care conditions for the targets. Our approach is implemented as custom software and utilizes the existing open-source symbolic algebra libraries for computations. We present experimental results of our approach on several integer multipliers benchmark and discuss the quality of the patch sub-circuits generated.
{"title":"Rectification of Integer Arithmetic Circuits using Computer Algebra Techniques","authors":"V. Rao, Haden Ondricek, P. Kalla, Florian Enescu","doi":"10.1109/ICCD53106.2021.00039","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00039","url":null,"abstract":"This paper proposes a symbolic algebra approach for multi-target rectification of integer arithmetic circuits. The circuit is represented as a system of polynomials and rectified against a polynomial specification with computations modeled over the field of rationals. Given a set of nets as potential rectification targets, we formulate a check to ascertain the existence of rectification functions at these targets. Upon confirmation, we compute the patch functions collectively for the targets. In this regard, we show how to synthesize a logic sub-circuit from polynomial artifacts generated over the field of rationals. We present new mathematical contributions and results to substantiate this synthesis process. We present two approaches for patch function computation: a greedy approach that resolves the rectification functions for the targets and an approach that explores a subset of don’t care conditions for the targets. Our approach is implemented as custom software and utilizes the existing open-source symbolic algebra libraries for computations. We present experimental results of our approach on several integer multipliers benchmark and discuss the quality of the patch sub-circuits generated.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116109991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00068
Xiaoyang Lu, Rujia Wang, Xian-He Sun
As the number of on-chip cores and application demands increase, efficient management of shared cache resources becomes imperative. Cache partitioning techniques have been studied for decades to reduce interference between applications in a shared cache and provide performance and fairness guarantees. However, there are few studies on how concurrent memory accesses affect the effectiveness of partitioning. When concurrent memory requests exist, cache miss does not reflect concurrency overlapping well. In this work, we first introduce pure misses per kilo instructions (PMPKI), a metric that quantifies the cache efficiency considering concurrent access activities. Then we propose Premier, a dynamically adaptive concurrency-aware cache pseudo-partitioning framework. Premier provides insertion and promotion policies based on PMPKI curves to achieve the benefits of cache partitioning. Finally, our evaluation of various workloads shows that Premier outperforms state-of-the-art cache partitioning schemes in terms of performance and fairness. In an 8-core system, Premier achieves 15.45% higher system performance and 10.91% better fairness than the UCP scheme.
{"title":"Premier: A Concurrency-Aware Pseudo-Partitioning Framework for Shared Last-Level Cache","authors":"Xiaoyang Lu, Rujia Wang, Xian-He Sun","doi":"10.1109/ICCD53106.2021.00068","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00068","url":null,"abstract":"As the number of on-chip cores and application demands increase, efficient management of shared cache resources becomes imperative. Cache partitioning techniques have been studied for decades to reduce interference between applications in a shared cache and provide performance and fairness guarantees. However, there are few studies on how concurrent memory accesses affect the effectiveness of partitioning. When concurrent memory requests exist, cache miss does not reflect concurrency overlapping well. In this work, we first introduce pure misses per kilo instructions (PMPKI), a metric that quantifies the cache efficiency considering concurrent access activities. Then we propose Premier, a dynamically adaptive concurrency-aware cache pseudo-partitioning framework. Premier provides insertion and promotion policies based on PMPKI curves to achieve the benefits of cache partitioning. Finally, our evaluation of various workloads shows that Premier outperforms state-of-the-art cache partitioning schemes in terms of performance and fairness. In an 8-core system, Premier achieves 15.45% higher system performance and 10.91% better fairness than the UCP scheme.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121966144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}