Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00013
Isaías B. Felzmann, João Fabrício Filho, Juliane Regina de Oliveira, L. Wanner
Approximate systems are designed to offer improved efficiency with potentially reduced quality of results. Quality of output in these systems is typically quantified in comparison to a precise result using metrics such as RMSE, MAE, PSNR, or application-specific metrics such as structural similarity of images (SSIM). Furthermore, systems are typically designed to maximize efficiency for a given minimum quality requirement. It is often difficult to determine what this quality requirement should be for an application, let alone a system. Thus, a fixed quality requirement may be overly conservative, and leave optimization opportunities on the table. In this work, we present a different approach to evaluate approximate systems based on the usefulness of results instead of quality. Our method qualitatively determines the acceptability of approximate results within different processing pipelines. To demonstrate the method, we implement three image and signal processing applications featuring scenarios of image classification, image recognition, and frequency estimation. Our results show that designing approximate systems to guarantee acceptability can produce up to 20% more valid results than the conservative quality thresholds commonly adopted in the literature, allowing for higher error rates and, consequently, lower energy cost.
{"title":"Special Session: How much quality is enough quality? A case for acceptability in approximate designs","authors":"Isaías B. Felzmann, João Fabrício Filho, Juliane Regina de Oliveira, L. Wanner","doi":"10.1109/ICCD53106.2021.00013","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00013","url":null,"abstract":"Approximate systems are designed to offer improved efficiency with potentially reduced quality of results. Quality of output in these systems is typically quantified in comparison to a precise result using metrics such as RMSE, MAE, PSNR, or application-specific metrics such as structural similarity of images (SSIM). Furthermore, systems are typically designed to maximize efficiency for a given minimum quality requirement. It is often difficult to determine what this quality requirement should be for an application, let alone a system. Thus, a fixed quality requirement may be overly conservative, and leave optimization opportunities on the table. In this work, we present a different approach to evaluate approximate systems based on the usefulness of results instead of quality. Our method qualitatively determines the acceptability of approximate results within different processing pipelines. To demonstrate the method, we implement three image and signal processing applications featuring scenarios of image classification, image recognition, and frequency estimation. Our results show that designing approximate systems to guarantee acceptability can produce up to 20% more valid results than the conservative quality thresholds commonly adopted in the literature, allowing for higher error rates and, consequently, lower energy cost.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129586708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00098
Khitam M. Alatoun, Shanmukha Murali Achyutha, R. Vemuri
Information flow properties are essential to identify security vulnerabilities in System-on-Chip (SoC) designs. Verifying information flow properties, such as integrity and confidentiality, is challenging as these properties cannot be handled using traditional assertion-based verification techniques. This paper proposes two novel approaches, a universal method and a property-driven method, to verify and monitor information flow properties. Both methods can be used for formal verification, dynamic verification during simulation, post-fabrication validation, and run-time monitoring. The universal method expedites implementing the information flow model and has less complexity than the most recently published technique. The property-driven method reduces the overhead of the security model, which helps speed up the verification process and create an efficient run-time hardware monitor. More than 20 information flow properties from 5 different designs were verified and several bugs were identified. We show that the method is scalable for large systems by applying it to an SoC design based on an OpenRISC-1200 processor.
{"title":"Efficient Methods for SoC Trust Validation Using Information Flow Verification","authors":"Khitam M. Alatoun, Shanmukha Murali Achyutha, R. Vemuri","doi":"10.1109/ICCD53106.2021.00098","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00098","url":null,"abstract":"Information flow properties are essential to identify security vulnerabilities in System-on-Chip (SoC) designs. Verifying information flow properties, such as integrity and confidentiality, is challenging as these properties cannot be handled using traditional assertion-based verification techniques. This paper proposes two novel approaches, a universal method and a property-driven method, to verify and monitor information flow properties. Both methods can be used for formal verification, dynamic verification during simulation, post-fabrication validation, and run-time monitoring. The universal method expedites implementing the information flow model and has less complexity than the most recently published technique. The property-driven method reduces the overhead of the security model, which helps speed up the verification process and create an efficient run-time hardware monitor. More than 20 information flow properties from 5 different designs were verified and several bugs were identified. We show that the method is scalable for large systems by applying it to an SoC design based on an OpenRISC-1200 processor.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130280750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00040
Guang Wang, Ziyuan Zhu, Shuan Li, Xu Cheng, Dan Meng
The instruction decoders are tools for software analysis, sandboxing, malware detection, and undocumented instructions detection. The decoders must be accurate and consistent with the instruction set architecture manuals. The existing testing methods for instruction decoders are based on random and instruction structure mutation. Moreover, the methods are mainly aimed at the legal instruction space. However, there is little research on whether the instructions in the reserved instruction space can be accurately identified as invalid instructions. We propose an instruction operand inferring algorithm, based on the depth-first search algorithm, to skip considerable redundant legal instruction space. The algorithm keeps the types of instructions in the legal instruction space unchanged and guarantees the traversal of the reserved instruction space. In addition, we propose a differential testing method that discovers decoding discrepancies between instruction decoders. We applied the method to XED and Capstone and found four million inconsistent instructions between them. Compared with the existing instruction generation method based on the depth-first search algorithm, the efficiency of our method is improved by about four times.
{"title":"Differential Testing of x86 Instruction Decoders with Instruction Operand Inferring Algorithm","authors":"Guang Wang, Ziyuan Zhu, Shuan Li, Xu Cheng, Dan Meng","doi":"10.1109/ICCD53106.2021.00040","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00040","url":null,"abstract":"The instruction decoders are tools for software analysis, sandboxing, malware detection, and undocumented instructions detection. The decoders must be accurate and consistent with the instruction set architecture manuals. The existing testing methods for instruction decoders are based on random and instruction structure mutation. Moreover, the methods are mainly aimed at the legal instruction space. However, there is little research on whether the instructions in the reserved instruction space can be accurately identified as invalid instructions. We propose an instruction operand inferring algorithm, based on the depth-first search algorithm, to skip considerable redundant legal instruction space. The algorithm keeps the types of instructions in the legal instruction space unchanged and guarantees the traversal of the reserved instruction space. In addition, we propose a differential testing method that discovers decoding discrepancies between instruction decoders. We applied the method to XED and Capstone and found four million inconsistent instructions between them. Compared with the existing instruction generation method based on the depth-first search algorithm, the efficiency of our method is improved by about four times.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128724085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00069
Pete Ehrett, Todd M. Austin, V. Bertacco
As computational demands rise, the need for specialized hardware has grown acute. However, the immense cost of fully-custom chips has forced many developers to rely on suboptimal solutions like FPGAs, especially for low- to mid-volume applications, in which multi-million-dollar non-recurring engineering (NRE) costs cannot be amortized effectively. We propose to address this problem by composing custom chips out of small, algorithmic chiplets, reusable across diverse designs, such that high NRE costs may be amortized across many different designs. This work models the economics of this paradigm and identifies a cost-optimal granularity for algorithmic chiplets, then demonstrates how those guidelines may be applied to design high-performance, algorithmically-composable hardware components – which may be reused, without modification, across many different processing pipelines. For an example phased-array radar accelerator, our chiplet-centric paradigm improves perf-per-$ by 9.3× over an FPGA, and ∼4× over a conventional ASIC.
{"title":"Chopin: Composing Cost-Effective Custom Chips with Algorithmic Chiplets","authors":"Pete Ehrett, Todd M. Austin, V. Bertacco","doi":"10.1109/ICCD53106.2021.00069","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00069","url":null,"abstract":"As computational demands rise, the need for specialized hardware has grown acute. However, the immense cost of fully-custom chips has forced many developers to rely on suboptimal solutions like FPGAs, especially for low- to mid-volume applications, in which multi-million-dollar non-recurring engineering (NRE) costs cannot be amortized effectively. We propose to address this problem by composing custom chips out of small, algorithmic chiplets, reusable across diverse designs, such that high NRE costs may be amortized across many different designs. This work models the economics of this paradigm and identifies a cost-optimal granularity for algorithmic chiplets, then demonstrates how those guidelines may be applied to design high-performance, algorithmically-composable hardware components – which may be reused, without modification, across many different processing pipelines. For an example phased-array radar accelerator, our chiplet-centric paradigm improves perf-per-$ by 9.3× over an FPGA, and ∼4× over a conventional ASIC.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123514086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00048
Xinxin Liu, Yu Hua, Xuan Li, Qifan Liu
To deliver high performance in cloud computing, many efforts leverage RDMA (Remote Direct Memory Access) in networking and NVMM (Non-Volatile Main Memory) in end systems. Due to no CPU involvement, one-sided RDMA becomes efficient to access the remote memory, and NVMM technologies have the strengths of non-volatility, byte-addressability and DRAM-like latency. However, due to the need to guarantee Remote Data Atomicity (RDA), the synergized scheme has to consume extra network round-trips, remote CPU participation and double NVMM writes. In order to address these problems, we propose a write-optimized log-structured NVMM design for Efficient Remote Data Atomicity, called Erda. In Erda, clients directly transfer data to the destination memory addresses in the logs on servers via one-sided RDMA writes without redundant copies and remote CPU consumption. To detect the atomicity of the fetched data, we verify a checksum without client-server coordination. We further ensure metadata consistency by leveraging an 8-byte atomic update in a hash table, which also contains the addresses of previous versions of data in the log for consistency. When a failure occurs, the server properly and efficiently restores to become consistent. Experimental results show that compared with state-of-the-art schemes, Erda reduces NVMM writes approximately by 50%, significantly improves throughput and decreases latency.
{"title":"Write-Optimized and Consistent RDMA-based Non-Volatile Main Memory Systems","authors":"Xinxin Liu, Yu Hua, Xuan Li, Qifan Liu","doi":"10.1109/ICCD53106.2021.00048","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00048","url":null,"abstract":"To deliver high performance in cloud computing, many efforts leverage RDMA (Remote Direct Memory Access) in networking and NVMM (Non-Volatile Main Memory) in end systems. Due to no CPU involvement, one-sided RDMA becomes efficient to access the remote memory, and NVMM technologies have the strengths of non-volatility, byte-addressability and DRAM-like latency. However, due to the need to guarantee Remote Data Atomicity (RDA), the synergized scheme has to consume extra network round-trips, remote CPU participation and double NVMM writes. In order to address these problems, we propose a write-optimized log-structured NVMM design for Efficient Remote Data Atomicity, called Erda. In Erda, clients directly transfer data to the destination memory addresses in the logs on servers via one-sided RDMA writes without redundant copies and remote CPU consumption. To detect the atomicity of the fetched data, we verify a checksum without client-server coordination. We further ensure metadata consistency by leveraging an 8-byte atomic update in a hash table, which also contains the addresses of previous versions of data in the log for consistency. When a failure occurs, the server properly and efficiently restores to become consistent. Experimental results show that compared with state-of-the-art schemes, Erda reduces NVMM writes approximately by 50%, significantly improves throughput and decreases latency.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122456717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00075
Daniel Rodrigues Carvalho, André Seznec
Cache compression algorithms must abide by hard-ware constraints; thus, their efficiency ends up being low, and most cache lines end up barely compressed. Moreover, schemes that compress relatively well often decompress slowly, and vice versa. This paper proposes a compression scheme achieving high (good) compaction ratio and fast decompression latency. The key observation is that by further subdividing the chunks of data being compressed one can tailor the algorithms. This concept is orthogonal to most existent compressors, and results in a reduction of their average compressed size. In particular, we leverage this concept to boost a single-cycle-decompression compressor to reach a compressibility level competitive to state-of-the-art proposals. When normalized against the best long decompression latency state-of-the-art compressors, the proposed ideas further enhance the average cache capacity by 2.7% (geometric mean), while featuring short decompression latency.
{"title":"Conciliating Speed and Efficiency on Cache Compressors","authors":"Daniel Rodrigues Carvalho, André Seznec","doi":"10.1109/ICCD53106.2021.00075","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00075","url":null,"abstract":"Cache compression algorithms must abide by hard-ware constraints; thus, their efficiency ends up being low, and most cache lines end up barely compressed. Moreover, schemes that compress relatively well often decompress slowly, and vice versa. This paper proposes a compression scheme achieving high (good) compaction ratio and fast decompression latency. The key observation is that by further subdividing the chunks of data being compressed one can tailor the algorithms. This concept is orthogonal to most existent compressors, and results in a reduction of their average compressed size. In particular, we leverage this concept to boost a single-cycle-decompression compressor to reach a compressibility level competitive to state-of-the-art proposals. When normalized against the best long decompression latency state-of-the-art compressors, the proposed ideas further enhance the average cache capacity by 2.7% (geometric mean), while featuring short decompression latency.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127799832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00085
Kyeongrok Jo, Taewhan Kim
The synthesis of standard cell layouts is largely divided into two tasks namely transistor placement and in-cell routing. Since the result of transistor placement highly affects the quality of in-cell routing, it is crucial to accurately and efficiently predict in-cell routability during transistor placement. In this work, we address the problem of an optimal transistor placement combined with global in-cell routing with the primary objective of minimizing cell size and the secondary objective of minimizing wirelength for global in-cell routing. To this end, unlike the conventional indirect and complex SMT (satisfiability modulo theory) formulation, we propose a method of direct and efficient formulation of the original problem based on SMT. Through experiments, it is confirmed that our proposed method is able to produce minimal-area cell layouts with minimal wirelength for global in-cell routing while spending much less running time over the conventional optimal layout generator.
{"title":"Optimal Transistor Placement Combined with Global In-cell Routing in Standard Cell Layout Synthesis","authors":"Kyeongrok Jo, Taewhan Kim","doi":"10.1109/ICCD53106.2021.00085","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00085","url":null,"abstract":"The synthesis of standard cell layouts is largely divided into two tasks namely transistor placement and in-cell routing. Since the result of transistor placement highly affects the quality of in-cell routing, it is crucial to accurately and efficiently predict in-cell routability during transistor placement. In this work, we address the problem of an optimal transistor placement combined with global in-cell routing with the primary objective of minimizing cell size and the secondary objective of minimizing wirelength for global in-cell routing. To this end, unlike the conventional indirect and complex SMT (satisfiability modulo theory) formulation, we propose a method of direct and efficient formulation of the original problem based on SMT. Through experiments, it is confirmed that our proposed method is able to produce minimal-area cell layouts with minimal wirelength for global in-cell routing while spending much less running time over the conventional optimal layout generator.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115726047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00056
Wei Zhang, Kaihua Fu, Ningxin Zheng, Quan Chen, Chao Li, Wenli Zheng, M. Guo
Emerging latency-critical (LC) services often have both CPU and GPU stages (e.g. DNN-assisted services) and require short response latency. Co-locating best-effort (BE) applications on the both CPU side and GPU side with the LC service improves resource utilization. However, resource contention often results in the QoS violation of LC services. We therefore present CHARM, a collaborative host-accelerator resource management system. CHARM ensures the required QoS target of DNN-assisted LC services, while maximizing the resource utilization of both the host and accelerator. CHARM is comprised of a BE-aware QoS target allocator, a unified heterogeneous resource manager, and a collaborative accelerator-side QoS compensator. The QoS target allocator determines the time limit of an LC service running on the host side and the accelerator side. The resource manager allocates the shared resources on both host side and accelerator side. The QoS compensator allocates more resources to the LC service to speed up its execution, if it runs slower than expected. Experimental results on an Nvidia GPU RTX 2080Ti show that CHARM improves the resource utilization by 43.2%, while ensuring the required QoS target compared with state-of-the-art solutions.
{"title":"CHARM: Collaborative Host and Accelerator Resource Management for GPU Datacenters","authors":"Wei Zhang, Kaihua Fu, Ningxin Zheng, Quan Chen, Chao Li, Wenli Zheng, M. Guo","doi":"10.1109/ICCD53106.2021.00056","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00056","url":null,"abstract":"Emerging latency-critical (LC) services often have both CPU and GPU stages (e.g. DNN-assisted services) and require short response latency. Co-locating best-effort (BE) applications on the both CPU side and GPU side with the LC service improves resource utilization. However, resource contention often results in the QoS violation of LC services. We therefore present CHARM, a collaborative host-accelerator resource management system. CHARM ensures the required QoS target of DNN-assisted LC services, while maximizing the resource utilization of both the host and accelerator. CHARM is comprised of a BE-aware QoS target allocator, a unified heterogeneous resource manager, and a collaborative accelerator-side QoS compensator. The QoS target allocator determines the time limit of an LC service running on the host side and the accelerator side. The resource manager allocates the shared resources on both host side and accelerator side. The QoS compensator allocates more resources to the LC service to speed up its execution, if it runs slower than expected. Experimental results on an Nvidia GPU RTX 2080Ti show that CHARM improves the resource utilization by 43.2%, while ensuring the required QoS target compared with state-of-the-art solutions.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114906518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00034
Z. Li, Zhipeng Tan, Jianxi Chen
Intel Optane DC Persistent Memory Module (DCPMM) is the first commercially available non-volatile memory (NVM) product and can be directly placed on the processor’s memory bus along with DRAM to serve as a hybrid memory. Compared with DRAM, NVM has 3× read latency and similar write latency, while the read and write bandwidths of NVM are only 1/3rd and 1/6th of those of DRAM. However, existing hashing schemes fail to reap those performance characteristics. We propose HASDH, a hotspot-aware and scalable dynamic hashing built on the hybrid DRAM-NVM memory. HASDH maintains structure metadata (i.e., directory) in DRAM and persists key-value items in NVM. To reduce hot key-value items’ access cost, HASDH caches frequently-accessed key-value items in DRAM with a dedicated caching strategy. To achieve scalable performance for multicore machines, HASDH maintains locks in DRAM that avoid the extra NVM read-write bandwidth consumption caused by lock operations. Furthermore, HASDH chains all NVM segments using sibling pointers to the right neighbors to ensure crash consistency and leverages log-free NVM segment split to reduce logging overhead. On an 18-core machine with Intel Optane DCPMM, experimental results show that HASDH achieves 1.43∼7.39× speedup for insertions, 2.08~9.63× speedup for searches, and 1.78~3.01× speedup for deletions, compared with start-of-the-art NVM-based hashing indexes.
英特尔Optane DC Persistent Memory Module (DCPMM)是第一款商用非易失性内存(NVM)产品,可以直接与DRAM一起放置在处理器的内存总线上,作为混合内存。与DRAM相比,NVM的读时延是DRAM的3倍,写时延也差不多,而读写带宽仅为DRAM的1/3和1/6。然而,现有的散列方案无法获得这些性能特征。我们提出了一种基于混合DRAM-NVM内存的热点感知和可扩展的动态哈希算法HASDH。HASDH在DRAM中维护结构元数据(即目录),并在NVM中持久化键值项。为了降低热键值项的访问成本,HASDH使用专用缓存策略将频繁访问的键值项缓存到DRAM中。为了实现多核机器的可扩展性能,HASDH在DRAM中维护锁,避免了锁操作造成的额外NVM读写带宽消耗。此外,HASDH使用兄弟指针将所有NVM段链接到正确的邻居,以确保崩溃一致性,并利用无日志的NVM段分割来减少日志开销。在使用Intel Optane DCPMM的18核机器上,实验结果表明,与基于nvm的初始哈希索引相比,HASDH在插入方面实现了1.43 ~ 7.39倍的加速,在搜索方面实现了2.08~9.63倍的加速,在删除方面实现了1.78~3.01倍的加速。
{"title":"HASDH: A Hotspot-Aware and Scalable Dynamic Hashing for Hybrid DRAM-NVM Memory","authors":"Z. Li, Zhipeng Tan, Jianxi Chen","doi":"10.1109/ICCD53106.2021.00034","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00034","url":null,"abstract":"Intel Optane DC Persistent Memory Module (DCPMM) is the first commercially available non-volatile memory (NVM) product and can be directly placed on the processor’s memory bus along with DRAM to serve as a hybrid memory. Compared with DRAM, NVM has 3× read latency and similar write latency, while the read and write bandwidths of NVM are only 1/3rd and 1/6th of those of DRAM. However, existing hashing schemes fail to reap those performance characteristics. We propose HASDH, a hotspot-aware and scalable dynamic hashing built on the hybrid DRAM-NVM memory. HASDH maintains structure metadata (i.e., directory) in DRAM and persists key-value items in NVM. To reduce hot key-value items’ access cost, HASDH caches frequently-accessed key-value items in DRAM with a dedicated caching strategy. To achieve scalable performance for multicore machines, HASDH maintains locks in DRAM that avoid the extra NVM read-write bandwidth consumption caused by lock operations. Furthermore, HASDH chains all NVM segments using sibling pointers to the right neighbors to ensure crash consistency and leverages log-free NVM segment split to reduce logging overhead. On an 18-core machine with Intel Optane DCPMM, experimental results show that HASDH achieves 1.43∼7.39× speedup for insertions, 2.08~9.63× speedup for searches, and 1.78~3.01× speedup for deletions, compared with start-of-the-art NVM-based hashing indexes.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122179364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00038
R. Rajaei, M. Niemier, X. Hu
As CMOS device sizes continue to scale down, radiation-related reliability issues are of ever-growing concern. Single event double node upsets (SEDUs) in sequential logic and single event transients (SETs) in combinational logic are sources of high rate radiation-induced soft errors that can affect the functionality of logic circuits. This paper presents effective circuit-level solutions for combating SEDUs/SETs in nanoscale sequential and combinational logic circuits. More specifically, we propose and evaluate low-power latch and flip-flop circuits to mitigate SEDUs and SETs. Simulations with a 22 nm PTM model reveal that the proposed circuits offer full immunity against SEDUs, can better filter SET pulses, and simultaneously reduce design overhead when compared to prior work. As a representative example, simulation-based studies show that our designs offer up to 77% improvements in delay-power-area product, and can filter out up to 58% wider SET pulses when compared to the state-of-the-art.
{"title":"Low-Cost Sequential Logic Circuit Design Considering Single Event Double-Node Upsets and Single Event Transients","authors":"R. Rajaei, M. Niemier, X. Hu","doi":"10.1109/ICCD53106.2021.00038","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00038","url":null,"abstract":"As CMOS device sizes continue to scale down, radiation-related reliability issues are of ever-growing concern. Single event double node upsets (SEDUs) in sequential logic and single event transients (SETs) in combinational logic are sources of high rate radiation-induced soft errors that can affect the functionality of logic circuits. This paper presents effective circuit-level solutions for combating SEDUs/SETs in nanoscale sequential and combinational logic circuits. More specifically, we propose and evaluate low-power latch and flip-flop circuits to mitigate SEDUs and SETs. Simulations with a 22 nm PTM model reveal that the proposed circuits offer full immunity against SEDUs, can better filter SET pulses, and simultaneously reduce design overhead when compared to prior work. As a representative example, simulation-based studies show that our designs offer up to 77% improvements in delay-power-area product, and can filter out up to 58% wider SET pulses when compared to the state-of-the-art.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126481511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}