With the development of NAND flash technology, hybrid SSDs with high-density and low-cost flash memory have become the mainstream of the existing SSD architecture. In this architecture, two flash modes can be dynamically switched, such as single-level cell (SLC) mode and quad-level cell (QLC) mode. Based on evaluations and analysis of multiple real devices, this paper presents two interesting findings. They demonstrate that the coordination between the two flash-modes is not well-designed in existing architectures. This paper proposes HyFlex, which redesigns the strategies of data placement and flash-mode management of hybrid SSDs in a flexible approach. Specifically, two novel optimization strategies are proposed: velocity-based I/O scheduling (VIS) and garbage collection (GC)-aware capacity tuning (GCT). Experimental results show that HyFlex achieves encouraging performance and endurance improvement.
{"title":"Understanding and Optimizing Hybrid SSD with High-Density and Low-Cost Flash Memory","authors":"Liang Shi, Longfei Luo, Yina Lv, Shicheng Li, Changlong Li, E. Sha","doi":"10.1109/ICCD53106.2021.00046","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00046","url":null,"abstract":"With the development of NAND flash technology, hybrid SSDs with high-density and low-cost flash memory have become the mainstream of the existing SSD architecture. In this architecture, two flash modes can be dynamically switched, such as single-level cell (SLC) mode and quad-level cell (QLC) mode. Based on evaluations and analysis of multiple real devices, this paper presents two interesting findings. They demonstrate that the coordination between the two flash-modes is not well-designed in existing architectures. This paper proposes HyFlex, which redesigns the strategies of data placement and flash-mode management of hybrid SSDs in a flexible approach. Specifically, two novel optimization strategies are proposed: velocity-based I/O scheduling (VIS) and garbage collection (GC)-aware capacity tuning (GCT). Experimental results show that HyFlex achieves encouraging performance and endurance improvement.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124951125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00059
Hossein Golestani, T. Wenisch
Datacenter systems rely on fast, efficient I/O soft-ware stacks—Software Data Planes (SDPs)—to coordinate frequent interaction among myriad processes (or VMs) and I/O devices (NICs, SSDs, etc.). Thanks to the impressive and ever-growing speed of today’s I/O devices and μs-scale computation due to hyper-tenancy and microservice-based applications, SDPs play a crucial role in overall system performance and efficiency. In this work, we aim to enhance data transfer among the SDP, I/O devices, and applications/VMs by designing the HyperData accelerator. Data items in SDP systems, such as network packets or storage blocks, are transferred through shared memory queues. Consumer cores typically access the data from DRAM or, thanks to technologies like Intel DDIO, from the (shared) last-level cache. Today, consumers cannot effectively prefetch such data to nearer caches due to the lack of a proper arrival notification mechanism and the complex access pattern of data buffers. HyperData is designed to perform targeted prefetching, wherein the exact data items (or a required subset) are prefetched to the L1 cache of the consumer core. Furthermore, HyperData is applicable to both core–device and core–core data communication, and it supports complex queue formats like Virtio and multi-consumer queues. HyperData is realized with a per-core programmable prefetcher, which issues the prefetch requests, and a system-level monitoring set, which monitors queues for data arrival and triggers prefetch operations. We show that HyperData improves processing latency by 1.20-2.42× in a simulation of a state-of-the-art SDP, with only a few hundred bytes of per-core overhead.
{"title":"HyperData: A Data Transfer Accelerator for Software Data Planes Based on Targeted Prefetching","authors":"Hossein Golestani, T. Wenisch","doi":"10.1109/ICCD53106.2021.00059","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00059","url":null,"abstract":"Datacenter systems rely on fast, efficient I/O soft-ware stacks—Software Data Planes (SDPs)—to coordinate frequent interaction among myriad processes (or VMs) and I/O devices (NICs, SSDs, etc.). Thanks to the impressive and ever-growing speed of today’s I/O devices and μs-scale computation due to hyper-tenancy and microservice-based applications, SDPs play a crucial role in overall system performance and efficiency. In this work, we aim to enhance data transfer among the SDP, I/O devices, and applications/VMs by designing the HyperData accelerator. Data items in SDP systems, such as network packets or storage blocks, are transferred through shared memory queues. Consumer cores typically access the data from DRAM or, thanks to technologies like Intel DDIO, from the (shared) last-level cache. Today, consumers cannot effectively prefetch such data to nearer caches due to the lack of a proper arrival notification mechanism and the complex access pattern of data buffers. HyperData is designed to perform targeted prefetching, wherein the exact data items (or a required subset) are prefetched to the L1 cache of the consumer core. Furthermore, HyperData is applicable to both core–device and core–core data communication, and it supports complex queue formats like Virtio and multi-consumer queues. HyperData is realized with a per-core programmable prefetcher, which issues the prefetch requests, and a system-level monitoring set, which monitors queues for data arrival and triggers prefetch operations. We show that HyperData improves processing latency by 1.20-2.42× in a simulation of a state-of-the-art SDP, with only a few hundred bytes of per-core overhead.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132626079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00069
Pete Ehrett, Todd M. Austin, V. Bertacco
As computational demands rise, the need for specialized hardware has grown acute. However, the immense cost of fully-custom chips has forced many developers to rely on suboptimal solutions like FPGAs, especially for low- to mid-volume applications, in which multi-million-dollar non-recurring engineering (NRE) costs cannot be amortized effectively. We propose to address this problem by composing custom chips out of small, algorithmic chiplets, reusable across diverse designs, such that high NRE costs may be amortized across many different designs. This work models the economics of this paradigm and identifies a cost-optimal granularity for algorithmic chiplets, then demonstrates how those guidelines may be applied to design high-performance, algorithmically-composable hardware components – which may be reused, without modification, across many different processing pipelines. For an example phased-array radar accelerator, our chiplet-centric paradigm improves perf-per-$ by 9.3× over an FPGA, and ∼4× over a conventional ASIC.
{"title":"Chopin: Composing Cost-Effective Custom Chips with Algorithmic Chiplets","authors":"Pete Ehrett, Todd M. Austin, V. Bertacco","doi":"10.1109/ICCD53106.2021.00069","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00069","url":null,"abstract":"As computational demands rise, the need for specialized hardware has grown acute. However, the immense cost of fully-custom chips has forced many developers to rely on suboptimal solutions like FPGAs, especially for low- to mid-volume applications, in which multi-million-dollar non-recurring engineering (NRE) costs cannot be amortized effectively. We propose to address this problem by composing custom chips out of small, algorithmic chiplets, reusable across diverse designs, such that high NRE costs may be amortized across many different designs. This work models the economics of this paradigm and identifies a cost-optimal granularity for algorithmic chiplets, then demonstrates how those guidelines may be applied to design high-performance, algorithmically-composable hardware components – which may be reused, without modification, across many different processing pipelines. For an example phased-array radar accelerator, our chiplet-centric paradigm improves perf-per-$ by 9.3× over an FPGA, and ∼4× over a conventional ASIC.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123514086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rise of Next-Generation Sequencing (NGS) technology, clinical sequencing services become more accessible but are also facing new challenges. The surging demand motivates developments of more efficient algorithms for computational genomics and their hardware acceleration. In this work, we use GPU to accelerate the DNA variant calling and its related alignment problem. The Pair-Hidden Markov Model (Pair-HMM) is one of the most popular and compute-intensive models used in variant calling. As a critical part of the Pair-HMM, the forward algorithm is not only a computational but data-intensive algorithm. Multiple previous works have been done in efforts to accelerate the computation of the forward algorithm by the massive parallelization of the workload. In this paper, we bring advanced GPU implementations with various optimizations, such as efficient host-device communication, task parallelization, pipelining, and memory management, to tackle this challenging task. Our design has shown a speedup of 783X comparing to the Java baseline on Intel single-core CPU, 31.88X to the C++ baseline on IBM Power8 multicore CPU, and 1.53X - 2.21X to the previous state-of-the-art GPU implementations over various genomics datasets.
{"title":"Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment","authors":"Enliang Li, Subho Sankar Banerjee, Sitao Huang, R. Iyer, Deming Chen","doi":"10.1109/ICCD53106.2021.00055","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00055","url":null,"abstract":"With the rise of Next-Generation Sequencing (NGS) technology, clinical sequencing services become more accessible but are also facing new challenges. The surging demand motivates developments of more efficient algorithms for computational genomics and their hardware acceleration. In this work, we use GPU to accelerate the DNA variant calling and its related alignment problem. The Pair-Hidden Markov Model (Pair-HMM) is one of the most popular and compute-intensive models used in variant calling. As a critical part of the Pair-HMM, the forward algorithm is not only a computational but data-intensive algorithm. Multiple previous works have been done in efforts to accelerate the computation of the forward algorithm by the massive parallelization of the workload. In this paper, we bring advanced GPU implementations with various optimizations, such as efficient host-device communication, task parallelization, pipelining, and memory management, to tackle this challenging task. Our design has shown a speedup of 783X comparing to the Java baseline on Intel single-core CPU, 31.88X to the C++ baseline on IBM Power8 multicore CPU, and 1.53X - 2.21X to the previous state-of-the-art GPU implementations over various genomics datasets.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122683774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00098
Khitam M. Alatoun, Shanmukha Murali Achyutha, R. Vemuri
Information flow properties are essential to identify security vulnerabilities in System-on-Chip (SoC) designs. Verifying information flow properties, such as integrity and confidentiality, is challenging as these properties cannot be handled using traditional assertion-based verification techniques. This paper proposes two novel approaches, a universal method and a property-driven method, to verify and monitor information flow properties. Both methods can be used for formal verification, dynamic verification during simulation, post-fabrication validation, and run-time monitoring. The universal method expedites implementing the information flow model and has less complexity than the most recently published technique. The property-driven method reduces the overhead of the security model, which helps speed up the verification process and create an efficient run-time hardware monitor. More than 20 information flow properties from 5 different designs were verified and several bugs were identified. We show that the method is scalable for large systems by applying it to an SoC design based on an OpenRISC-1200 processor.
{"title":"Efficient Methods for SoC Trust Validation Using Information Flow Verification","authors":"Khitam M. Alatoun, Shanmukha Murali Achyutha, R. Vemuri","doi":"10.1109/ICCD53106.2021.00098","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00098","url":null,"abstract":"Information flow properties are essential to identify security vulnerabilities in System-on-Chip (SoC) designs. Verifying information flow properties, such as integrity and confidentiality, is challenging as these properties cannot be handled using traditional assertion-based verification techniques. This paper proposes two novel approaches, a universal method and a property-driven method, to verify and monitor information flow properties. Both methods can be used for formal verification, dynamic verification during simulation, post-fabrication validation, and run-time monitoring. The universal method expedites implementing the information flow model and has less complexity than the most recently published technique. The property-driven method reduces the overhead of the security model, which helps speed up the verification process and create an efficient run-time hardware monitor. More than 20 information flow properties from 5 different designs were verified and several bugs were identified. We show that the method is scalable for large systems by applying it to an SoC design based on an OpenRISC-1200 processor.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130280750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00013
Isaías B. Felzmann, João Fabrício Filho, Juliane Regina de Oliveira, L. Wanner
Approximate systems are designed to offer improved efficiency with potentially reduced quality of results. Quality of output in these systems is typically quantified in comparison to a precise result using metrics such as RMSE, MAE, PSNR, or application-specific metrics such as structural similarity of images (SSIM). Furthermore, systems are typically designed to maximize efficiency for a given minimum quality requirement. It is often difficult to determine what this quality requirement should be for an application, let alone a system. Thus, a fixed quality requirement may be overly conservative, and leave optimization opportunities on the table. In this work, we present a different approach to evaluate approximate systems based on the usefulness of results instead of quality. Our method qualitatively determines the acceptability of approximate results within different processing pipelines. To demonstrate the method, we implement three image and signal processing applications featuring scenarios of image classification, image recognition, and frequency estimation. Our results show that designing approximate systems to guarantee acceptability can produce up to 20% more valid results than the conservative quality thresholds commonly adopted in the literature, allowing for higher error rates and, consequently, lower energy cost.
{"title":"Special Session: How much quality is enough quality? A case for acceptability in approximate designs","authors":"Isaías B. Felzmann, João Fabrício Filho, Juliane Regina de Oliveira, L. Wanner","doi":"10.1109/ICCD53106.2021.00013","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00013","url":null,"abstract":"Approximate systems are designed to offer improved efficiency with potentially reduced quality of results. Quality of output in these systems is typically quantified in comparison to a precise result using metrics such as RMSE, MAE, PSNR, or application-specific metrics such as structural similarity of images (SSIM). Furthermore, systems are typically designed to maximize efficiency for a given minimum quality requirement. It is often difficult to determine what this quality requirement should be for an application, let alone a system. Thus, a fixed quality requirement may be overly conservative, and leave optimization opportunities on the table. In this work, we present a different approach to evaluate approximate systems based on the usefulness of results instead of quality. Our method qualitatively determines the acceptability of approximate results within different processing pipelines. To demonstrate the method, we implement three image and signal processing applications featuring scenarios of image classification, image recognition, and frequency estimation. Our results show that designing approximate systems to guarantee acceptability can produce up to 20% more valid results than the conservative quality thresholds commonly adopted in the literature, allowing for higher error rates and, consequently, lower energy cost.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129586708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00095
Z. Wang, S. Mohammed, Y. Makris, Benjamin Carrión Schäfer
VLSI design companies are now mainly fabless and spend large amount of resources to develop their Intellectual Property (IP). It is therefore paramount to protect their IPs from being stolen and illegally reversed engineered. The main approach so far to protect the IP has been to add additional locking logic such that the circuit does not meet the given specifications if the user does not apply the correct key. The main problem with this approach is that the fabless company has to submit the entire design, including the locking circuitry, to the fab. Moreover, these companies often subcontract the VLSI design back-end to a third-party. This implies that the third-party company or fab could potentially tamper with the locking mechanism. One alternative approach is to lock through omission. The main idea is to judiciously select a portion of the design and map it onto an embedded FPGA (eFPGA). In this case, the bitstream acts as the logic key. Third party company nor the fab will, in this case, have access to the locking mechanism as the eFPGA is left un-programmed. This is obviously a more secure way to lock the circuit. The main problem with this approach is the area, power, and delay overhead associated with it. To address this, in this work, we present a framework that takes as input an untimed behavioral description for High-Level Synthesis (HLS) and automatically extracts a portion of the circuit to the eFPGA such that the area overhead is minimized while the original timing constraint is not violated. The main advantage of starting at the behavioral level is that partitioning the design at this stage allows the HLS process to fully re-optimize the circuit, thus, reducing the overhead introduced by this obfuscation mechanism. We also developed a framework to test our proposed approach and plan to release it to the community to encourage the community to find new techniques to break the proposed obfuscation method.
{"title":"Functional Locking through Omission: From HLS to Obfuscated Design","authors":"Z. Wang, S. Mohammed, Y. Makris, Benjamin Carrión Schäfer","doi":"10.1109/ICCD53106.2021.00095","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00095","url":null,"abstract":"VLSI design companies are now mainly fabless and spend large amount of resources to develop their Intellectual Property (IP). It is therefore paramount to protect their IPs from being stolen and illegally reversed engineered. The main approach so far to protect the IP has been to add additional locking logic such that the circuit does not meet the given specifications if the user does not apply the correct key. The main problem with this approach is that the fabless company has to submit the entire design, including the locking circuitry, to the fab. Moreover, these companies often subcontract the VLSI design back-end to a third-party. This implies that the third-party company or fab could potentially tamper with the locking mechanism. One alternative approach is to lock through omission. The main idea is to judiciously select a portion of the design and map it onto an embedded FPGA (eFPGA). In this case, the bitstream acts as the logic key. Third party company nor the fab will, in this case, have access to the locking mechanism as the eFPGA is left un-programmed. This is obviously a more secure way to lock the circuit. The main problem with this approach is the area, power, and delay overhead associated with it. To address this, in this work, we present a framework that takes as input an untimed behavioral description for High-Level Synthesis (HLS) and automatically extracts a portion of the circuit to the eFPGA such that the area overhead is minimized while the original timing constraint is not violated. The main advantage of starting at the behavioral level is that partitioning the design at this stage allows the HLS process to fully re-optimize the circuit, thus, reducing the overhead introduced by this obfuscation mechanism. We also developed a framework to test our proposed approach and plan to release it to the community to encourage the community to find new techniques to break the proposed obfuscation method.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129719944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00045
H. Gong, Zhirong Shen, J. Shu
3D flash memory removes scaling limitations of planar flash memory, yet it is still plagued by the tedious GC process due to the “big block problem”. In this paper, we propose SpeedupGC, a framework that incorporates the characteristics of data updates into existing sub-block erase designs. The main idea of SpeedupGC is to guide the hotly-updated data to the blocks that are about to be erased, so as to speculatively produce more invalid pages and suppress the relocation overhead. We conduct extensive trace-driven experiments, showing that SpeedupGC can averagely reduce 64.7% of the GC latency, 21.8% of the read latency, 17.7% of the write latency, and 11.5% of the write amplification when compared to state-of-the-art designs.
{"title":"Accelerating Sub-Block Erase in 3D NAND Flash Memory","authors":"H. Gong, Zhirong Shen, J. Shu","doi":"10.1109/ICCD53106.2021.00045","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00045","url":null,"abstract":"3D flash memory removes scaling limitations of planar flash memory, yet it is still plagued by the tedious GC process due to the “big block problem”. In this paper, we propose SpeedupGC, a framework that incorporates the characteristics of data updates into existing sub-block erase designs. The main idea of SpeedupGC is to guide the hotly-updated data to the blocks that are about to be erased, so as to speculatively produce more invalid pages and suppress the relocation overhead. We conduct extensive trace-driven experiments, showing that SpeedupGC can averagely reduce 64.7% of the GC latency, 21.8% of the read latency, 17.7% of the write latency, and 11.5% of the write amplification when compared to state-of-the-art designs.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125077432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00073
Qianqian Pei, Seunghee Shin
Emerging non-volatile memory technology enables non-volatile main memory (NVMM) that can provide larger capacity and better energy-saving opportunities than DRAMs. However, its non-volatility raises security concerns, where the data in NVMMs can be taken if the memory is stolen. Thereby, the data must stay encrypted outside the processor boundary. Such encryption requires decryption before the data being used by the processor, adding extra latency to the performance-critical read operations. Split counter mode encryption hides the latency but introduces frequent page re-encryptions as a trade-off. We find that such re-encryption overhead worsens on the NVMM, whose slow latency negates prior optimizations.To mitigate the overhead, we re-design the encryption scheme based on two key observations. First, we observe that NVMMs only need counters that can count up to twice their lifetime. Second, we observe diminishing returns on the counter size as increasing the counter size further does not necessarily decrease the re-encryption frequency. Our new designs re-arrange those inefficiently used bits to reduce the re-encryption overhead. In the tests, our two designs, 3-level split counter mode encryption and 8-block split counter mode encryption, effectively reduce the re-encryption overheads by 63% and 66%, which improve performances by 26% and 30% at maximum and by 8% and 9% on average from the original split counter scheme.
{"title":"Improving the Heavy Re-encryption Overhead of Split Counter Mode Encryption for NVM","authors":"Qianqian Pei, Seunghee Shin","doi":"10.1109/ICCD53106.2021.00073","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00073","url":null,"abstract":"Emerging non-volatile memory technology enables non-volatile main memory (NVMM) that can provide larger capacity and better energy-saving opportunities than DRAMs. However, its non-volatility raises security concerns, where the data in NVMMs can be taken if the memory is stolen. Thereby, the data must stay encrypted outside the processor boundary. Such encryption requires decryption before the data being used by the processor, adding extra latency to the performance-critical read operations. Split counter mode encryption hides the latency but introduces frequent page re-encryptions as a trade-off. We find that such re-encryption overhead worsens on the NVMM, whose slow latency negates prior optimizations.To mitigate the overhead, we re-design the encryption scheme based on two key observations. First, we observe that NVMMs only need counters that can count up to twice their lifetime. Second, we observe diminishing returns on the counter size as increasing the counter size further does not necessarily decrease the re-encryption frequency. Our new designs re-arrange those inefficiently used bits to reduce the re-encryption overhead. In the tests, our two designs, 3-level split counter mode encryption and 8-block split counter mode encryption, effectively reduce the re-encryption overheads by 63% and 66%, which improve performances by 26% and 30% at maximum and by 8% and 9% on average from the original split counter scheme.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130535765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00070
Toru Koizumi, Shu Sugita, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai
The renaming unit of a superscalar processor is a very expensive module. It consumes large amounts of power and limits the front-end bandwidth. To overcome this problem, an instruction set architecture called STRAIGHT has been proposed. Owing to its unique manner of referencing operands, STRAIGHT does not cause false dependencies and allows out-of-order execution without register renaming. However, the compiler optimization techniques for STRAIGHT are still immature, and we found that the naive code generators currently available can generate inefficient code with additional instructions. In this paper, we propose two novel compiler optimization techniques and a novel calling convention for STRAIGHT to reduce the number of instructions. We compiled real-world programs with a compiler that implemented these techniques and measured their performance through simulation. The evaluation results show that the proposed methods reduced the number of executed instructions by 15% and improved the performance by 17%.
{"title":"Compiling and Optimizing Real-world Programs for STRAIGHT ISA","authors":"Toru Koizumi, Shu Sugita, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai","doi":"10.1109/ICCD53106.2021.00070","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00070","url":null,"abstract":"The renaming unit of a superscalar processor is a very expensive module. It consumes large amounts of power and limits the front-end bandwidth. To overcome this problem, an instruction set architecture called STRAIGHT has been proposed. Owing to its unique manner of referencing operands, STRAIGHT does not cause false dependencies and allows out-of-order execution without register renaming. However, the compiler optimization techniques for STRAIGHT are still immature, and we found that the naive code generators currently available can generate inefficient code with additional instructions. In this paper, we propose two novel compiler optimization techniques and a novel calling convention for STRAIGHT to reduce the number of instructions. We compiled real-world programs with a compiler that implemented these techniques and measured their performance through simulation. The evaluation results show that the proposed methods reduced the number of executed instructions by 15% and improved the performance by 17%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129897935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}