Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00040
Guang Wang, Ziyuan Zhu, Shuan Li, Xu Cheng, Dan Meng
The instruction decoders are tools for software analysis, sandboxing, malware detection, and undocumented instructions detection. The decoders must be accurate and consistent with the instruction set architecture manuals. The existing testing methods for instruction decoders are based on random and instruction structure mutation. Moreover, the methods are mainly aimed at the legal instruction space. However, there is little research on whether the instructions in the reserved instruction space can be accurately identified as invalid instructions. We propose an instruction operand inferring algorithm, based on the depth-first search algorithm, to skip considerable redundant legal instruction space. The algorithm keeps the types of instructions in the legal instruction space unchanged and guarantees the traversal of the reserved instruction space. In addition, we propose a differential testing method that discovers decoding discrepancies between instruction decoders. We applied the method to XED and Capstone and found four million inconsistent instructions between them. Compared with the existing instruction generation method based on the depth-first search algorithm, the efficiency of our method is improved by about four times.
{"title":"Differential Testing of x86 Instruction Decoders with Instruction Operand Inferring Algorithm","authors":"Guang Wang, Ziyuan Zhu, Shuan Li, Xu Cheng, Dan Meng","doi":"10.1109/ICCD53106.2021.00040","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00040","url":null,"abstract":"The instruction decoders are tools for software analysis, sandboxing, malware detection, and undocumented instructions detection. The decoders must be accurate and consistent with the instruction set architecture manuals. The existing testing methods for instruction decoders are based on random and instruction structure mutation. Moreover, the methods are mainly aimed at the legal instruction space. However, there is little research on whether the instructions in the reserved instruction space can be accurately identified as invalid instructions. We propose an instruction operand inferring algorithm, based on the depth-first search algorithm, to skip considerable redundant legal instruction space. The algorithm keeps the types of instructions in the legal instruction space unchanged and guarantees the traversal of the reserved instruction space. In addition, we propose a differential testing method that discovers decoding discrepancies between instruction decoders. We applied the method to XED and Capstone and found four million inconsistent instructions between them. Compared with the existing instruction generation method based on the depth-first search algorithm, the efficiency of our method is improved by about four times.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128724085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00076
Trung Le, Zhao Zhang, Zhichun Zhu
Modern conventional DRAM main memory system will no longer satisfy the growing demand for capacity and bandwidth on today’s data-intensive applications. Non-Volatile Memory (NVM) has been extensively researched as the alter-native for DRAM-based system due to its higher density and non-volatile characteristics. Hybrid memory system benefits from both DRAM and NVM technologies, however traditional Memory Controller (MC) cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI. It uses a small buffer chip inserted on each DIMM to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, which relies on the main MC to govern all DIMMs, POMI uses polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main MC so that the polling overhead is trivial. This gives several benefits: technology-independent memory system, higher parallelism, and better scalability. Our experimental results using octa-core workloads show that POMI can efficiently support heterogeneous systems and it outperforms an existing interface for hybrid memory systems by 22.0% on average for memory-intensive workloads.
{"title":"POMI: Polling-Based Memory Interface for Hybrid Memory System","authors":"Trung Le, Zhao Zhang, Zhichun Zhu","doi":"10.1109/ICCD53106.2021.00076","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00076","url":null,"abstract":"Modern conventional DRAM main memory system will no longer satisfy the growing demand for capacity and bandwidth on today’s data-intensive applications. Non-Volatile Memory (NVM) has been extensively researched as the alter-native for DRAM-based system due to its higher density and non-volatile characteristics. Hybrid memory system benefits from both DRAM and NVM technologies, however traditional Memory Controller (MC) cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI. It uses a small buffer chip inserted on each DIMM to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, which relies on the main MC to govern all DIMMs, POMI uses polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main MC so that the polling overhead is trivial. This gives several benefits: technology-independent memory system, higher parallelism, and better scalability. Our experimental results using octa-core workloads show that POMI can efficiently support heterogeneous systems and it outperforms an existing interface for hybrid memory systems by 22.0% on average for memory-intensive workloads.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133725373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00014
Di Wu, Joshua San Miguel
Deep Neural Networks (DNNs) have gained significant attention in both academia and industry due to the superior application-level accuracy. As DNNs rely on compute- or memory-intensive general matrix multiply (GEMM) operations, approximate computing has been widely explored across the computing stack to mitigate the hardware overheads. However, better-performing DNNs are emerging with growing complexity in their use of nonlinear operations, which incurs even more hardware cost. In this work, we address this challenge by proposing a reconfigurable systolic array to execute both GEMM and nonlinear operations via approximation with distinguished dataflows. Experiments demonstrate that such converging of dataflows significantly saves the hardware cost of emerging DNN inference.
{"title":"Special Session: When Dataflows Converge: Reconfigurable and Approximate Computing for Emerging Neural Networks","authors":"Di Wu, Joshua San Miguel","doi":"10.1109/ICCD53106.2021.00014","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00014","url":null,"abstract":"Deep Neural Networks (DNNs) have gained significant attention in both academia and industry due to the superior application-level accuracy. As DNNs rely on compute- or memory-intensive general matrix multiply (GEMM) operations, approximate computing has been widely explored across the computing stack to mitigate the hardware overheads. However, better-performing DNNs are emerging with growing complexity in their use of nonlinear operations, which incurs even more hardware cost. In this work, we address this challenge by proposing a reconfigurable systolic array to execute both GEMM and nonlinear operations via approximation with distinguished dataflows. Experiments demonstrate that such converging of dataflows significantly saves the hardware cost of emerging DNN inference.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131663301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00033
Xinyu Li, Huimin Cui, Lei Liu
Persistent memory (PM) featured with data persistence, byte-addressability, and DRAM-like performance has been commercially available with the advent of Intel® Optane™ DC persistent memory. The DRAM-like performance and disk-like persistence invite shifting hashing-based index schemes, which are important building blocks of today’s internet service infrastructures to provide fast queries, from DRAM onto persistent memory. Numerous hash indexes for persistent memory have been proposed to optimize writes and crash consistency, but with poor scalability under resizing. Generally, resizing consists of allocating a new hash table and rehashing items from the old table into the new one. We argue that resizing with rehashing performed in either blocking or non-blocking way can degrade the overall performance and limit the scalability.In order to mitigate the limitation of resizing, this paper proposes a Non-Rehashing Hash Index (NRHI) scheme to perform resizing with no necessity of rehashing items. NRHI leverages a layered structure to link hash tables without moving key-value pairs across layers, thus reducing the time spent on rehashing in blocking way and alleviating slots contention occurred in non-blocking way. Furthermore, the compare-and-swap primitive is utilized to support concurrent lock-free hashing operations. Experimental results on real PM hardware show that NRHI outperforms the state-of-the-art PM hash indexes by 1.7× to 3.59×, and scales linearly with the number of threads.
{"title":"NRHI: A Concurrent Non-Rehashing Hash Index for Persistent Memory","authors":"Xinyu Li, Huimin Cui, Lei Liu","doi":"10.1109/ICCD53106.2021.00033","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00033","url":null,"abstract":"Persistent memory (PM) featured with data persistence, byte-addressability, and DRAM-like performance has been commercially available with the advent of Intel® Optane™ DC persistent memory. The DRAM-like performance and disk-like persistence invite shifting hashing-based index schemes, which are important building blocks of today’s internet service infrastructures to provide fast queries, from DRAM onto persistent memory. Numerous hash indexes for persistent memory have been proposed to optimize writes and crash consistency, but with poor scalability under resizing. Generally, resizing consists of allocating a new hash table and rehashing items from the old table into the new one. We argue that resizing with rehashing performed in either blocking or non-blocking way can degrade the overall performance and limit the scalability.In order to mitigate the limitation of resizing, this paper proposes a Non-Rehashing Hash Index (NRHI) scheme to perform resizing with no necessity of rehashing items. NRHI leverages a layered structure to link hash tables without moving key-value pairs across layers, thus reducing the time spent on rehashing in blocking way and alleviating slots contention occurred in non-blocking way. Furthermore, the compare-and-swap primitive is utilized to support concurrent lock-free hashing operations. Experimental results on real PM hardware show that NRHI outperforms the state-of-the-art PM hash indexes by 1.7× to 3.59×, and scales linearly with the number of threads.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133504170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00048
Xinxin Liu, Yu Hua, Xuan Li, Qifan Liu
To deliver high performance in cloud computing, many efforts leverage RDMA (Remote Direct Memory Access) in networking and NVMM (Non-Volatile Main Memory) in end systems. Due to no CPU involvement, one-sided RDMA becomes efficient to access the remote memory, and NVMM technologies have the strengths of non-volatility, byte-addressability and DRAM-like latency. However, due to the need to guarantee Remote Data Atomicity (RDA), the synergized scheme has to consume extra network round-trips, remote CPU participation and double NVMM writes. In order to address these problems, we propose a write-optimized log-structured NVMM design for Efficient Remote Data Atomicity, called Erda. In Erda, clients directly transfer data to the destination memory addresses in the logs on servers via one-sided RDMA writes without redundant copies and remote CPU consumption. To detect the atomicity of the fetched data, we verify a checksum without client-server coordination. We further ensure metadata consistency by leveraging an 8-byte atomic update in a hash table, which also contains the addresses of previous versions of data in the log for consistency. When a failure occurs, the server properly and efficiently restores to become consistent. Experimental results show that compared with state-of-the-art schemes, Erda reduces NVMM writes approximately by 50%, significantly improves throughput and decreases latency.
{"title":"Write-Optimized and Consistent RDMA-based Non-Volatile Main Memory Systems","authors":"Xinxin Liu, Yu Hua, Xuan Li, Qifan Liu","doi":"10.1109/ICCD53106.2021.00048","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00048","url":null,"abstract":"To deliver high performance in cloud computing, many efforts leverage RDMA (Remote Direct Memory Access) in networking and NVMM (Non-Volatile Main Memory) in end systems. Due to no CPU involvement, one-sided RDMA becomes efficient to access the remote memory, and NVMM technologies have the strengths of non-volatility, byte-addressability and DRAM-like latency. However, due to the need to guarantee Remote Data Atomicity (RDA), the synergized scheme has to consume extra network round-trips, remote CPU participation and double NVMM writes. In order to address these problems, we propose a write-optimized log-structured NVMM design for Efficient Remote Data Atomicity, called Erda. In Erda, clients directly transfer data to the destination memory addresses in the logs on servers via one-sided RDMA writes without redundant copies and remote CPU consumption. To detect the atomicity of the fetched data, we verify a checksum without client-server coordination. We further ensure metadata consistency by leveraging an 8-byte atomic update in a hash table, which also contains the addresses of previous versions of data in the log for consistency. When a failure occurs, the server properly and efficiently restores to become consistent. Experimental results show that compared with state-of-the-art schemes, Erda reduces NVMM writes approximately by 50%, significantly improves throughput and decreases latency.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122456717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00075
Daniel Rodrigues Carvalho, André Seznec
Cache compression algorithms must abide by hard-ware constraints; thus, their efficiency ends up being low, and most cache lines end up barely compressed. Moreover, schemes that compress relatively well often decompress slowly, and vice versa. This paper proposes a compression scheme achieving high (good) compaction ratio and fast decompression latency. The key observation is that by further subdividing the chunks of data being compressed one can tailor the algorithms. This concept is orthogonal to most existent compressors, and results in a reduction of their average compressed size. In particular, we leverage this concept to boost a single-cycle-decompression compressor to reach a compressibility level competitive to state-of-the-art proposals. When normalized against the best long decompression latency state-of-the-art compressors, the proposed ideas further enhance the average cache capacity by 2.7% (geometric mean), while featuring short decompression latency.
{"title":"Conciliating Speed and Efficiency on Cache Compressors","authors":"Daniel Rodrigues Carvalho, André Seznec","doi":"10.1109/ICCD53106.2021.00075","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00075","url":null,"abstract":"Cache compression algorithms must abide by hard-ware constraints; thus, their efficiency ends up being low, and most cache lines end up barely compressed. Moreover, schemes that compress relatively well often decompress slowly, and vice versa. This paper proposes a compression scheme achieving high (good) compaction ratio and fast decompression latency. The key observation is that by further subdividing the chunks of data being compressed one can tailor the algorithms. This concept is orthogonal to most existent compressors, and results in a reduction of their average compressed size. In particular, we leverage this concept to boost a single-cycle-decompression compressor to reach a compressibility level competitive to state-of-the-art proposals. When normalized against the best long decompression latency state-of-the-art compressors, the proposed ideas further enhance the average cache capacity by 2.7% (geometric mean), while featuring short decompression latency.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127799832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00085
Kyeongrok Jo, Taewhan Kim
The synthesis of standard cell layouts is largely divided into two tasks namely transistor placement and in-cell routing. Since the result of transistor placement highly affects the quality of in-cell routing, it is crucial to accurately and efficiently predict in-cell routability during transistor placement. In this work, we address the problem of an optimal transistor placement combined with global in-cell routing with the primary objective of minimizing cell size and the secondary objective of minimizing wirelength for global in-cell routing. To this end, unlike the conventional indirect and complex SMT (satisfiability modulo theory) formulation, we propose a method of direct and efficient formulation of the original problem based on SMT. Through experiments, it is confirmed that our proposed method is able to produce minimal-area cell layouts with minimal wirelength for global in-cell routing while spending much less running time over the conventional optimal layout generator.
{"title":"Optimal Transistor Placement Combined with Global In-cell Routing in Standard Cell Layout Synthesis","authors":"Kyeongrok Jo, Taewhan Kim","doi":"10.1109/ICCD53106.2021.00085","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00085","url":null,"abstract":"The synthesis of standard cell layouts is largely divided into two tasks namely transistor placement and in-cell routing. Since the result of transistor placement highly affects the quality of in-cell routing, it is crucial to accurately and efficiently predict in-cell routability during transistor placement. In this work, we address the problem of an optimal transistor placement combined with global in-cell routing with the primary objective of minimizing cell size and the secondary objective of minimizing wirelength for global in-cell routing. To this end, unlike the conventional indirect and complex SMT (satisfiability modulo theory) formulation, we propose a method of direct and efficient formulation of the original problem based on SMT. Through experiments, it is confirmed that our proposed method is able to produce minimal-area cell layouts with minimal wirelength for global in-cell routing while spending much less running time over the conventional optimal layout generator.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115726047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00056
Wei Zhang, Kaihua Fu, Ningxin Zheng, Quan Chen, Chao Li, Wenli Zheng, M. Guo
Emerging latency-critical (LC) services often have both CPU and GPU stages (e.g. DNN-assisted services) and require short response latency. Co-locating best-effort (BE) applications on the both CPU side and GPU side with the LC service improves resource utilization. However, resource contention often results in the QoS violation of LC services. We therefore present CHARM, a collaborative host-accelerator resource management system. CHARM ensures the required QoS target of DNN-assisted LC services, while maximizing the resource utilization of both the host and accelerator. CHARM is comprised of a BE-aware QoS target allocator, a unified heterogeneous resource manager, and a collaborative accelerator-side QoS compensator. The QoS target allocator determines the time limit of an LC service running on the host side and the accelerator side. The resource manager allocates the shared resources on both host side and accelerator side. The QoS compensator allocates more resources to the LC service to speed up its execution, if it runs slower than expected. Experimental results on an Nvidia GPU RTX 2080Ti show that CHARM improves the resource utilization by 43.2%, while ensuring the required QoS target compared with state-of-the-art solutions.
{"title":"CHARM: Collaborative Host and Accelerator Resource Management for GPU Datacenters","authors":"Wei Zhang, Kaihua Fu, Ningxin Zheng, Quan Chen, Chao Li, Wenli Zheng, M. Guo","doi":"10.1109/ICCD53106.2021.00056","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00056","url":null,"abstract":"Emerging latency-critical (LC) services often have both CPU and GPU stages (e.g. DNN-assisted services) and require short response latency. Co-locating best-effort (BE) applications on the both CPU side and GPU side with the LC service improves resource utilization. However, resource contention often results in the QoS violation of LC services. We therefore present CHARM, a collaborative host-accelerator resource management system. CHARM ensures the required QoS target of DNN-assisted LC services, while maximizing the resource utilization of both the host and accelerator. CHARM is comprised of a BE-aware QoS target allocator, a unified heterogeneous resource manager, and a collaborative accelerator-side QoS compensator. The QoS target allocator determines the time limit of an LC service running on the host side and the accelerator side. The resource manager allocates the shared resources on both host side and accelerator side. The QoS compensator allocates more resources to the LC service to speed up its execution, if it runs slower than expected. Experimental results on an Nvidia GPU RTX 2080Ti show that CHARM improves the resource utilization by 43.2%, while ensuring the required QoS target compared with state-of-the-art solutions.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114906518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00034
Z. Li, Zhipeng Tan, Jianxi Chen
Intel Optane DC Persistent Memory Module (DCPMM) is the first commercially available non-volatile memory (NVM) product and can be directly placed on the processor’s memory bus along with DRAM to serve as a hybrid memory. Compared with DRAM, NVM has 3× read latency and similar write latency, while the read and write bandwidths of NVM are only 1/3rd and 1/6th of those of DRAM. However, existing hashing schemes fail to reap those performance characteristics. We propose HASDH, a hotspot-aware and scalable dynamic hashing built on the hybrid DRAM-NVM memory. HASDH maintains structure metadata (i.e., directory) in DRAM and persists key-value items in NVM. To reduce hot key-value items’ access cost, HASDH caches frequently-accessed key-value items in DRAM with a dedicated caching strategy. To achieve scalable performance for multicore machines, HASDH maintains locks in DRAM that avoid the extra NVM read-write bandwidth consumption caused by lock operations. Furthermore, HASDH chains all NVM segments using sibling pointers to the right neighbors to ensure crash consistency and leverages log-free NVM segment split to reduce logging overhead. On an 18-core machine with Intel Optane DCPMM, experimental results show that HASDH achieves 1.43∼7.39× speedup for insertions, 2.08~9.63× speedup for searches, and 1.78~3.01× speedup for deletions, compared with start-of-the-art NVM-based hashing indexes.
英特尔Optane DC Persistent Memory Module (DCPMM)是第一款商用非易失性内存(NVM)产品,可以直接与DRAM一起放置在处理器的内存总线上,作为混合内存。与DRAM相比,NVM的读时延是DRAM的3倍,写时延也差不多,而读写带宽仅为DRAM的1/3和1/6。然而,现有的散列方案无法获得这些性能特征。我们提出了一种基于混合DRAM-NVM内存的热点感知和可扩展的动态哈希算法HASDH。HASDH在DRAM中维护结构元数据(即目录),并在NVM中持久化键值项。为了降低热键值项的访问成本,HASDH使用专用缓存策略将频繁访问的键值项缓存到DRAM中。为了实现多核机器的可扩展性能,HASDH在DRAM中维护锁,避免了锁操作造成的额外NVM读写带宽消耗。此外,HASDH使用兄弟指针将所有NVM段链接到正确的邻居,以确保崩溃一致性,并利用无日志的NVM段分割来减少日志开销。在使用Intel Optane DCPMM的18核机器上,实验结果表明,与基于nvm的初始哈希索引相比,HASDH在插入方面实现了1.43 ~ 7.39倍的加速,在搜索方面实现了2.08~9.63倍的加速,在删除方面实现了1.78~3.01倍的加速。
{"title":"HASDH: A Hotspot-Aware and Scalable Dynamic Hashing for Hybrid DRAM-NVM Memory","authors":"Z. Li, Zhipeng Tan, Jianxi Chen","doi":"10.1109/ICCD53106.2021.00034","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00034","url":null,"abstract":"Intel Optane DC Persistent Memory Module (DCPMM) is the first commercially available non-volatile memory (NVM) product and can be directly placed on the processor’s memory bus along with DRAM to serve as a hybrid memory. Compared with DRAM, NVM has 3× read latency and similar write latency, while the read and write bandwidths of NVM are only 1/3rd and 1/6th of those of DRAM. However, existing hashing schemes fail to reap those performance characteristics. We propose HASDH, a hotspot-aware and scalable dynamic hashing built on the hybrid DRAM-NVM memory. HASDH maintains structure metadata (i.e., directory) in DRAM and persists key-value items in NVM. To reduce hot key-value items’ access cost, HASDH caches frequently-accessed key-value items in DRAM with a dedicated caching strategy. To achieve scalable performance for multicore machines, HASDH maintains locks in DRAM that avoid the extra NVM read-write bandwidth consumption caused by lock operations. Furthermore, HASDH chains all NVM segments using sibling pointers to the right neighbors to ensure crash consistency and leverages log-free NVM segment split to reduce logging overhead. On an 18-core machine with Intel Optane DCPMM, experimental results show that HASDH achieves 1.43∼7.39× speedup for insertions, 2.08~9.63× speedup for searches, and 1.78~3.01× speedup for deletions, compared with start-of-the-art NVM-based hashing indexes.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122179364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00038
R. Rajaei, M. Niemier, X. Hu
As CMOS device sizes continue to scale down, radiation-related reliability issues are of ever-growing concern. Single event double node upsets (SEDUs) in sequential logic and single event transients (SETs) in combinational logic are sources of high rate radiation-induced soft errors that can affect the functionality of logic circuits. This paper presents effective circuit-level solutions for combating SEDUs/SETs in nanoscale sequential and combinational logic circuits. More specifically, we propose and evaluate low-power latch and flip-flop circuits to mitigate SEDUs and SETs. Simulations with a 22 nm PTM model reveal that the proposed circuits offer full immunity against SEDUs, can better filter SET pulses, and simultaneously reduce design overhead when compared to prior work. As a representative example, simulation-based studies show that our designs offer up to 77% improvements in delay-power-area product, and can filter out up to 58% wider SET pulses when compared to the state-of-the-art.
{"title":"Low-Cost Sequential Logic Circuit Design Considering Single Event Double-Node Upsets and Single Event Transients","authors":"R. Rajaei, M. Niemier, X. Hu","doi":"10.1109/ICCD53106.2021.00038","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00038","url":null,"abstract":"As CMOS device sizes continue to scale down, radiation-related reliability issues are of ever-growing concern. Single event double node upsets (SEDUs) in sequential logic and single event transients (SETs) in combinational logic are sources of high rate radiation-induced soft errors that can affect the functionality of logic circuits. This paper presents effective circuit-level solutions for combating SEDUs/SETs in nanoscale sequential and combinational logic circuits. More specifically, we propose and evaluate low-power latch and flip-flop circuits to mitigate SEDUs and SETs. Simulations with a 22 nm PTM model reveal that the proposed circuits offer full immunity against SEDUs, can better filter SET pulses, and simultaneously reduce design overhead when compared to prior work. As a representative example, simulation-based studies show that our designs offer up to 77% improvements in delay-power-area product, and can filter out up to 58% wider SET pulses when compared to the state-of-the-art.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126481511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}