2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献

英文中文

Differential Testing of x86 Instruction Decoders with Instruction Operand Inferring Algorithm 基于指令操作数推断算法的x86指令解码器差分测试

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00040

Guang Wang, Ziyuan Zhu, Shuan Li, Xu Cheng, Dan Meng

The instruction decoders are tools for software analysis, sandboxing, malware detection, and undocumented instructions detection. The decoders must be accurate and consistent with the instruction set architecture manuals. The existing testing methods for instruction decoders are based on random and instruction structure mutation. Moreover, the methods are mainly aimed at the legal instruction space. However, there is little research on whether the instructions in the reserved instruction space can be accurately identified as invalid instructions. We propose an instruction operand inferring algorithm, based on the depth-first search algorithm, to skip considerable redundant legal instruction space. The algorithm keeps the types of instructions in the legal instruction space unchanged and guarantees the traversal of the reserved instruction space. In addition, we propose a differential testing method that discovers decoding discrepancies between instruction decoders. We applied the method to XED and Capstone and found four million inconsistent instructions between them. Compared with the existing instruction generation method based on the depth-first search algorithm, the efficiency of our method is improved by about four times.

指令解码器是用于软件分析、沙箱、恶意软件检测和未记录指令检测的工具。解码器必须准确并与指令集架构手册一致。现有的指令解码器测试方法主要基于随机和指令结构突变。此外，这些方法主要针对法律教学空间。然而，对于保留指令空间中的指令能否准确识别为无效指令，目前的研究还很少。我们提出了一种基于深度优先搜索算法的指令操作数推断算法，以跳过大量冗余的合法指令空间。该算法保持合法指令空间中的指令类型不变，并保证保留指令空间的遍历。此外，我们提出了一种差分测试方法来发现指令解码器之间的解码差异。我们将该方法应用于XED和Capstone，发现它们之间有400万条不一致的指令。与现有的基于深度优先搜索算法的指令生成方法相比，该方法的效率提高了约4倍。

{"title":"Differential Testing of x86 Instruction Decoders with Instruction Operand Inferring Algorithm","authors":"Guang Wang, Ziyuan Zhu, Shuan Li, Xu Cheng, Dan Meng","doi":"10.1109/ICCD53106.2021.00040","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00040","url":null,"abstract":"The instruction decoders are tools for software analysis, sandboxing, malware detection, and undocumented instructions detection. The decoders must be accurate and consistent with the instruction set architecture manuals. The existing testing methods for instruction decoders are based on random and instruction structure mutation. Moreover, the methods are mainly aimed at the legal instruction space. However, there is little research on whether the instructions in the reserved instruction space can be accurately identified as invalid instructions. We propose an instruction operand inferring algorithm, based on the depth-first search algorithm, to skip considerable redundant legal instruction space. The algorithm keeps the types of instructions in the legal instruction space unchanged and guarantees the traversal of the reserved instruction space. In addition, we propose a differential testing method that discovers decoding discrepancies between instruction decoders. We applied the method to XED and Capstone and found four million inconsistent instructions between them. Compared with the existing instruction generation method based on the depth-first search algorithm, the efficiency of our method is improved by about four times.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128724085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

POMI: Polling-Based Memory Interface for Hybrid Memory System 基于轮询的混合存储系统接口

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00076

Trung Le, Zhao Zhang, Zhichun Zhu

Modern conventional DRAM main memory system will no longer satisfy the growing demand for capacity and bandwidth on today’s data-intensive applications. Non-Volatile Memory (NVM) has been extensively researched as the alter-native for DRAM-based system due to its higher density and non-volatile characteristics. Hybrid memory system benefits from both DRAM and NVM technologies, however traditional Memory Controller (MC) cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI. It uses a small buffer chip inserted on each DIMM to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, which relies on the main MC to govern all DIMMs, POMI uses polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main MC so that the polling overhead is trivial. This gives several benefits: technology-independent memory system, higher parallelism, and better scalability. Our experimental results using octa-core workloads show that POMI can efficiently support heterogeneous systems and it outperforms an existing interface for hybrid memory systems by 22.0% on average for memory-intensive workloads.

现代传统的DRAM主存系统已不能满足当今数据密集型应用对容量和带宽日益增长的需求。非易失性存储器(Non-Volatile Memory, NVM)由于具有更高的存储密度和非易失性，作为基于dram的存储系统的替代方案得到了广泛的研究。混合存储系统受益于DRAM和NVM技术，但由于各种存储技术的不同时序要求和复杂的架构支持，传统的内存控制器(MC)无法有效地跟踪和调度异构系统中所有存储设备的操作。为了解决这个问题，我们提出了一个称为POMI的混合内存架构框架。它在每个DIMM上插入一个小的缓冲芯片，将操作调度与控制器解耦，从而支持系统中的各种存储技术。传统的基于内存的系统依赖于主MC来管理所有内存，而POMI使用基于轮询的内存总线协议进行通信，并解决内存模块之间的总线冲突。每个DIMM上的缓冲芯片将向主MC提供反馈信息，因此轮询开销是微不足道的。这有几个好处:技术无关的内存系统、更高的并行性和更好的可伸缩性。我们使用八核工作负载的实验结果表明，POMI可以有效地支持异构系统，并且在内存密集型工作负载下，它比混合内存系统的现有接口平均高出22.0%。

{"title":"POMI: Polling-Based Memory Interface for Hybrid Memory System","authors":"Trung Le, Zhao Zhang, Zhichun Zhu","doi":"10.1109/ICCD53106.2021.00076","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00076","url":null,"abstract":"Modern conventional DRAM main memory system will no longer satisfy the growing demand for capacity and bandwidth on today’s data-intensive applications. Non-Volatile Memory (NVM) has been extensively researched as the alter-native for DRAM-based system due to its higher density and non-volatile characteristics. Hybrid memory system benefits from both DRAM and NVM technologies, however traditional Memory Controller (MC) cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI. It uses a small buffer chip inserted on each DIMM to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, which relies on the main MC to govern all DIMMs, POMI uses polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main MC so that the polling overhead is trivial. This gives several benefits: technology-independent memory system, higher parallelism, and better scalability. Our experimental results using octa-core workloads show that POMI can efficiently support heterogeneous systems and it outperforms an existing interface for hybrid memory systems by 22.0% on average for memory-intensive workloads.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133725373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Special Session: When Dataflows Converge: Reconfigurable and Approximate Computing for Emerging Neural Networks 专题会议:当数据流收敛:新兴神经网络的可重构和近似计算

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00014

Di Wu, Joshua San Miguel

Deep Neural Networks (DNNs) have gained significant attention in both academia and industry due to the superior application-level accuracy. As DNNs rely on compute- or memory-intensive general matrix multiply (GEMM) operations, approximate computing has been widely explored across the computing stack to mitigate the hardware overheads. However, better-performing DNNs are emerging with growing complexity in their use of nonlinear operations, which incurs even more hardware cost. In this work, we address this challenge by proposing a reconfigurable systolic array to execute both GEMM and nonlinear operations via approximation with distinguished dataflows. Experiments demonstrate that such converging of dataflows significantly saves the hardware cost of emerging DNN inference.

深度神经网络(Deep Neural Networks, dnn)由于其优越的应用级精度，在学术界和工业界都受到了极大的关注。由于深度神经网络依赖于计算或内存密集型的一般矩阵乘法(GEMM)操作，因此在计算堆栈上广泛探索近似计算以减少硬件开销。然而，性能更好的深度神经网络正在出现，其使用的非线性运算越来越复杂，这导致了更多的硬件成本。在这项工作中，我们提出了一个可重构的收缩阵列来执行GEMM和非线性操作，通过近似不同的数据流来解决这一挑战。实验表明，这种数据流的收敛大大节省了新兴深度神经网络推理的硬件成本。

引用次数: 4

NRHI: A Concurrent Non-Rehashing Hash Index for Persistent Memory NRHI:用于持久内存的并发非重哈希哈希索引

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00033

Xinyu Li, Huimin Cui, Lei Liu

Persistent memory (PM) featured with data persistence, byte-addressability, and DRAM-like performance has been commercially available with the advent of Intel® Optane™ DC persistent memory. The DRAM-like performance and disk-like persistence invite shifting hashing-based index schemes, which are important building blocks of today’s internet service infrastructures to provide fast queries, from DRAM onto persistent memory. Numerous hash indexes for persistent memory have been proposed to optimize writes and crash consistency, but with poor scalability under resizing. Generally, resizing consists of allocating a new hash table and rehashing items from the old table into the new one. We argue that resizing with rehashing performed in either blocking or non-blocking way can degrade the overall performance and limit the scalability.In order to mitigate the limitation of resizing, this paper proposes a Non-Rehashing Hash Index (NRHI) scheme to perform resizing with no necessity of rehashing items. NRHI leverages a layered structure to link hash tables without moving key-value pairs across layers, thus reducing the time spent on rehashing in blocking way and alleviating slots contention occurred in non-blocking way. Furthermore, the compare-and-swap primitive is utilized to support concurrent lock-free hashing operations. Experimental results on real PM hardware show that NRHI outperforms the state-of-the-art PM hash indexes by 1.7× to 3.59×, and scales linearly with the number of threads.

随着英特尔®Optane™DC持久内存的出现，具有数据持久性、字节可寻址性和类似dram性能的持久内存(PM)已经商业化。类似DRAM的性能和类似磁盘的持久性需要基于散列的索引方案的移动，这是当今互联网服务基础设施的重要组成部分，可以提供从DRAM到持久内存的快速查询。已经提出了用于持久内存的许多散列索引来优化写入和崩溃一致性，但是在调整大小时可伸缩性很差。通常，调整大小包括分配一个新的哈希表，并将旧表中的项重新散列到新表中。我们认为，以阻塞或非阻塞方式执行的重哈希调整大小会降低整体性能并限制可扩展性。为了减轻调整大小的局限性，本文提出了一种非重新哈希哈希索引(Non-Rehashing Hash Index, NRHI)方案，在不需要重新哈希的情况下执行调整大小。NRHI利用分层结构来链接哈希表，而无需跨层移动键值对，从而减少了以阻塞方式重新哈希所花费的时间，并减轻了以非阻塞方式发生的槽争用。此外，还利用比较-交换原语来支持并发无锁哈希操作。在实际PM硬件上的实验结果表明，NRHI比最先进的PM哈希索引高出1.7到3.59倍，并且随着线程数的增加呈线性扩展。

{"title":"NRHI: A Concurrent Non-Rehashing Hash Index for Persistent Memory","authors":"Xinyu Li, Huimin Cui, Lei Liu","doi":"10.1109/ICCD53106.2021.00033","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00033","url":null,"abstract":"Persistent memory (PM) featured with data persistence, byte-addressability, and DRAM-like performance has been commercially available with the advent of Intel® Optane™ DC persistent memory. The DRAM-like performance and disk-like persistence invite shifting hashing-based index schemes, which are important building blocks of today’s internet service infrastructures to provide fast queries, from DRAM onto persistent memory. Numerous hash indexes for persistent memory have been proposed to optimize writes and crash consistency, but with poor scalability under resizing. Generally, resizing consists of allocating a new hash table and rehashing items from the old table into the new one. We argue that resizing with rehashing performed in either blocking or non-blocking way can degrade the overall performance and limit the scalability.In order to mitigate the limitation of resizing, this paper proposes a Non-Rehashing Hash Index (NRHI) scheme to perform resizing with no necessity of rehashing items. NRHI leverages a layered structure to link hash tables without moving key-value pairs across layers, thus reducing the time spent on rehashing in blocking way and alleviating slots contention occurred in non-blocking way. Furthermore, the compare-and-swap primitive is utilized to support concurrent lock-free hashing operations. Experimental results on real PM hardware show that NRHI outperforms the state-of-the-art PM hash indexes by 1.7× to 3.59×, and scales linearly with the number of threads.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133504170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Write-Optimized and Consistent RDMA-based Non-Volatile Main Memory Systems 基于写入优化和一致性rdma的非易失性主存系统

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00048

Xinxin Liu, Yu Hua, Xuan Li, Qifan Liu

To deliver high performance in cloud computing, many efforts leverage RDMA (Remote Direct Memory Access) in networking and NVMM (Non-Volatile Main Memory) in end systems. Due to no CPU involvement, one-sided RDMA becomes efficient to access the remote memory, and NVMM technologies have the strengths of non-volatility, byte-addressability and DRAM-like latency. However, due to the need to guarantee Remote Data Atomicity (RDA), the synergized scheme has to consume extra network round-trips, remote CPU participation and double NVMM writes. In order to address these problems, we propose a write-optimized log-structured NVMM design for Efficient Remote Data Atomicity, called Erda. In Erda, clients directly transfer data to the destination memory addresses in the logs on servers via one-sided RDMA writes without redundant copies and remote CPU consumption. To detect the atomicity of the fetched data, we verify a checksum without client-server coordination. We further ensure metadata consistency by leveraging an 8-byte atomic update in a hash table, which also contains the addresses of previous versions of data in the log for consistency. When a failure occurs, the server properly and efficiently restores to become consistent. Experimental results show that compared with state-of-the-art schemes, Erda reduces NVMM writes approximately by 50%, significantly improves throughput and decreases latency.

为了在云计算中提供高性能，许多工作在网络中利用RDMA(远程直接内存访问)，在终端系统中利用NVMM(非易失性主内存)。由于不涉及CPU，单侧RDMA访问远程内存变得高效，NVMM技术具有非易失性、字节寻址性和类似dram的延迟的优势。然而，由于需要保证远程数据原子性(RDA)，协同方案必须消耗额外的网络往返、远程CPU参与和双重NVMM写入。为了解决这些问题，我们提出了一种写优化的日志结构NVMM设计，用于高效远程数据原子性，称为Erda。在Erda中，客户端通过单侧RDMA写入直接将数据传输到服务器日志中的目标内存地址，而不需要冗余副本和远程CPU消耗。为了检测所获取数据的原子性，我们在没有客户机-服务器协调的情况下验证校验和。我们通过利用哈希表中的8字节原子更新进一步确保元数据的一致性，哈希表还包含日志中以前版本数据的地址，以保持一致性。当发生故障时，服务器可以正确有效地恢复到一致状态。实验结果表明，与最先进的方案相比，Erda将NVMM写入减少了大约50%，显着提高了吞吐量并降低了延迟。

{"title":"Write-Optimized and Consistent RDMA-based Non-Volatile Main Memory Systems","authors":"Xinxin Liu, Yu Hua, Xuan Li, Qifan Liu","doi":"10.1109/ICCD53106.2021.00048","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00048","url":null,"abstract":"To deliver high performance in cloud computing, many efforts leverage RDMA (Remote Direct Memory Access) in networking and NVMM (Non-Volatile Main Memory) in end systems. Due to no CPU involvement, one-sided RDMA becomes efficient to access the remote memory, and NVMM technologies have the strengths of non-volatility, byte-addressability and DRAM-like latency. However, due to the need to guarantee Remote Data Atomicity (RDA), the synergized scheme has to consume extra network round-trips, remote CPU participation and double NVMM writes. In order to address these problems, we propose a write-optimized log-structured NVMM design for Efficient Remote Data Atomicity, called Erda. In Erda, clients directly transfer data to the destination memory addresses in the logs on servers via one-sided RDMA writes without redundant copies and remote CPU consumption. To detect the atomicity of the fetched data, we verify a checksum without client-server coordination. We further ensure metadata consistency by leveraging an 8-byte atomic update in a hash table, which also contains the addresses of previous versions of data in the log for consistency. When a failure occurs, the server properly and efficiently restores to become consistent. Experimental results show that compared with state-of-the-art schemes, Erda reduces NVMM writes approximately by 50%, significantly improves throughput and decreases latency.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122456717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Conciliating Speed and Efficiency on Cache Compressors 缓存压缩器的调和速度和效率

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00075

Daniel Rodrigues Carvalho, André Seznec

Cache compression algorithms must abide by hard-ware constraints; thus, their efficiency ends up being low, and most cache lines end up barely compressed. Moreover, schemes that compress relatively well often decompress slowly, and vice versa. This paper proposes a compression scheme achieving high (good) compaction ratio and fast decompression latency. The key observation is that by further subdividing the chunks of data being compressed one can tailor the algorithms. This concept is orthogonal to most existent compressors, and results in a reduction of their average compressed size. In particular, we leverage this concept to boost a single-cycle-decompression compressor to reach a compressibility level competitive to state-of-the-art proposals. When normalized against the best long decompression latency state-of-the-art compressors, the proposed ideas further enhance the average cache capacity by 2.7% (geometric mean), while featuring short decompression latency.

缓存压缩算法必须遵守硬件约束;因此，它们的效率最终很低，并且大多数缓存行最终几乎没有被压缩。此外，压缩相对较好的方案通常解压较慢，反之亦然。本文提出了一种压缩方案，可以实现高(好的)压缩比和快速的解压缩延迟。关键的观察是，通过进一步细分被压缩的数据块，可以定制算法。这个概念是正交的大多数现有的压缩机，并导致减少他们的平均压缩尺寸。特别是，我们利用这一概念来提升单循环减压压缩机，使其可压缩性达到与最先进的方案相竞争的水平。当针对最佳的长解压延迟的最先进的压缩器进行规范化时，所提出的想法进一步提高了平均缓存容量2.7%(几何平均值)，同时具有短的解压延迟。

引用次数: 2

Optimal Transistor Placement Combined with Global In-cell Routing in Standard Cell Layout Synthesis 在标准单元布局合成中结合全局单元内布线的最佳晶体管放置

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00085

Kyeongrok Jo, Taewhan Kim

The synthesis of standard cell layouts is largely divided into two tasks namely transistor placement and in-cell routing. Since the result of transistor placement highly affects the quality of in-cell routing, it is crucial to accurately and efficiently predict in-cell routability during transistor placement. In this work, we address the problem of an optimal transistor placement combined with global in-cell routing with the primary objective of minimizing cell size and the secondary objective of minimizing wirelength for global in-cell routing. To this end, unlike the conventional indirect and complex SMT (satisfiability modulo theory) formulation, we propose a method of direct and efficient formulation of the original problem based on SMT. Through experiments, it is confirmed that our proposed method is able to produce minimal-area cell layouts with minimal wirelength for global in-cell routing while spending much less running time over the conventional optimal layout generator.

标准电池布局的综合主要分为两个任务，即晶体管放置和电池内布线。由于晶体管放置的结果对单元内路由的质量影响很大，因此在晶体管放置过程中准确有效地预测单元内路由是至关重要的。在这项工作中，我们解决了最佳晶体管放置与全局单元内路由相结合的问题，其主要目标是最小化单元尺寸，次要目标是最小化全局单元内路由的无线长度。为此，与传统的间接和复杂的SMT(可满足模理论)表述不同，我们提出了一种基于SMT的原始问题的直接和有效表述方法。通过实验证实，我们提出的方法能够以最小的无线长度为全局单元内路由产生最小面积的单元布局，同时比传统的最优布局生成器花费更少的运行时间。

引用次数: 1

CHARM: Collaborative Host and Accelerator Resource Management for GPU Datacenters CHARM: GPU数据中心的协作主机和加速器资源管理

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00056

Wei Zhang, Kaihua Fu, Ningxin Zheng, Quan Chen, Chao Li, Wenli Zheng, M. Guo

Emerging latency-critical (LC) services often have both CPU and GPU stages (e.g. DNN-assisted services) and require short response latency. Co-locating best-effort (BE) applications on the both CPU side and GPU side with the LC service improves resource utilization. However, resource contention often results in the QoS violation of LC services. We therefore present CHARM, a collaborative host-accelerator resource management system. CHARM ensures the required QoS target of DNN-assisted LC services, while maximizing the resource utilization of both the host and accelerator. CHARM is comprised of a BE-aware QoS target allocator, a unified heterogeneous resource manager, and a collaborative accelerator-side QoS compensator. The QoS target allocator determines the time limit of an LC service running on the host side and the accelerator side. The resource manager allocates the shared resources on both host side and accelerator side. The QoS compensator allocates more resources to the LC service to speed up its execution, if it runs slower than expected. Experimental results on an Nvidia GPU RTX 2080Ti show that CHARM improves the resource utilization by 43.2%, while ensuring the required QoS target compared with state-of-the-art solutions.

新兴的延迟关键型(LC)服务通常同时具有CPU和GPU两个阶段(例如dnn辅助服务)，并且需要较短的响应延迟。使用LC服务在CPU端和GPU端同时配置best-effort (BE)应用程序可以提高资源利用率。然而，资源争用往往会导致LC服务的QoS冲突。因此，我们提出了CHARM，一个协作主机加速器资源管理系统。CHARM确保了dnn辅助LC服务所需的QoS目标，同时最大限度地提高了主机和加速器的资源利用率。CHARM由一个感知be的QoS目标分配器、一个统一的异构资源管理器和一个协作的加速器端QoS补偿器组成。QoS目标分配器决定LC服务在主机端和加速器端运行的时间限制。资源管理器在主机端和加速器端分配共享资源。如果LC服务的运行速度低于预期，QoS补偿器将为其分配更多的资源，以加快其执行速度。在Nvidia GPU RTX 2080Ti上的实验结果表明，与最先进的解决方案相比，CHARM将资源利用率提高了43.2%，同时确保了所需的QoS目标。

{"title":"CHARM: Collaborative Host and Accelerator Resource Management for GPU Datacenters","authors":"Wei Zhang, Kaihua Fu, Ningxin Zheng, Quan Chen, Chao Li, Wenli Zheng, M. Guo","doi":"10.1109/ICCD53106.2021.00056","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00056","url":null,"abstract":"Emerging latency-critical (LC) services often have both CPU and GPU stages (e.g. DNN-assisted services) and require short response latency. Co-locating best-effort (BE) applications on the both CPU side and GPU side with the LC service improves resource utilization. However, resource contention often results in the QoS violation of LC services. We therefore present CHARM, a collaborative host-accelerator resource management system. CHARM ensures the required QoS target of DNN-assisted LC services, while maximizing the resource utilization of both the host and accelerator. CHARM is comprised of a BE-aware QoS target allocator, a unified heterogeneous resource manager, and a collaborative accelerator-side QoS compensator. The QoS target allocator determines the time limit of an LC service running on the host side and the accelerator side. The resource manager allocates the shared resources on both host side and accelerator side. The QoS compensator allocates more resources to the LC service to speed up its execution, if it runs slower than expected. Experimental results on an Nvidia GPU RTX 2080Ti show that CHARM improves the resource utilization by 43.2%, while ensuring the required QoS target compared with state-of-the-art solutions.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114906518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

HASDH: A Hotspot-Aware and Scalable Dynamic Hashing for Hybrid DRAM-NVM Memory hashh:用于混合ram - nvm内存的热点感知和可扩展动态哈希

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00034

Z. Li, Zhipeng Tan, Jianxi Chen

Intel Optane DC Persistent Memory Module (DCPMM) is the first commercially available non-volatile memory (NVM) product and can be directly placed on the processor’s memory bus along with DRAM to serve as a hybrid memory. Compared with DRAM, NVM has 3× read latency and similar write latency, while the read and write bandwidths of NVM are only 1/3rd and 1/6th of those of DRAM. However, existing hashing schemes fail to reap those performance characteristics. We propose HASDH, a hotspot-aware and scalable dynamic hashing built on the hybrid DRAM-NVM memory. HASDH maintains structure metadata (i.e., directory) in DRAM and persists key-value items in NVM. To reduce hot key-value items’ access cost, HASDH caches frequently-accessed key-value items in DRAM with a dedicated caching strategy. To achieve scalable performance for multicore machines, HASDH maintains locks in DRAM that avoid the extra NVM read-write bandwidth consumption caused by lock operations. Furthermore, HASDH chains all NVM segments using sibling pointers to the right neighbors to ensure crash consistency and leverages log-free NVM segment split to reduce logging overhead. On an 18-core machine with Intel Optane DCPMM, experimental results show that HASDH achieves 1.43∼7.39× speedup for insertions, 2.08~9.63× speedup for searches, and 1.78~3.01× speedup for deletions, compared with start-of-the-art NVM-based hashing indexes.

英特尔Optane DC Persistent Memory Module (DCPMM)是第一款商用非易失性内存(NVM)产品，可以直接与DRAM一起放置在处理器的内存总线上，作为混合内存。与DRAM相比，NVM的读时延是DRAM的3倍，写时延也差不多，而读写带宽仅为DRAM的1/3和1/6。然而，现有的散列方案无法获得这些性能特征。我们提出了一种基于混合DRAM-NVM内存的热点感知和可扩展的动态哈希算法HASDH。HASDH在DRAM中维护结构元数据(即目录)，并在NVM中持久化键值项。为了降低热键值项的访问成本，HASDH使用专用缓存策略将频繁访问的键值项缓存到DRAM中。为了实现多核机器的可扩展性能，HASDH在DRAM中维护锁，避免了锁操作造成的额外NVM读写带宽消耗。此外，HASDH使用兄弟指针将所有NVM段链接到正确的邻居，以确保崩溃一致性，并利用无日志的NVM段分割来减少日志开销。在使用Intel Optane DCPMM的18核机器上，实验结果表明，与基于nvm的初始哈希索引相比，HASDH在插入方面实现了1.43 ~ 7.39倍的加速，在搜索方面实现了2.08~9.63倍的加速，在删除方面实现了1.78~3.01倍的加速。

{"title":"HASDH: A Hotspot-Aware and Scalable Dynamic Hashing for Hybrid DRAM-NVM Memory","authors":"Z. Li, Zhipeng Tan, Jianxi Chen","doi":"10.1109/ICCD53106.2021.00034","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00034","url":null,"abstract":"Intel Optane DC Persistent Memory Module (DCPMM) is the first commercially available non-volatile memory (NVM) product and can be directly placed on the processor’s memory bus along with DRAM to serve as a hybrid memory. Compared with DRAM, NVM has 3× read latency and similar write latency, while the read and write bandwidths of NVM are only 1/3rd and 1/6th of those of DRAM. However, existing hashing schemes fail to reap those performance characteristics. We propose HASDH, a hotspot-aware and scalable dynamic hashing built on the hybrid DRAM-NVM memory. HASDH maintains structure metadata (i.e., directory) in DRAM and persists key-value items in NVM. To reduce hot key-value items’ access cost, HASDH caches frequently-accessed key-value items in DRAM with a dedicated caching strategy. To achieve scalable performance for multicore machines, HASDH maintains locks in DRAM that avoid the extra NVM read-write bandwidth consumption caused by lock operations. Furthermore, HASDH chains all NVM segments using sibling pointers to the right neighbors to ensure crash consistency and leverages log-free NVM segment split to reduce logging overhead. On an 18-core machine with Intel Optane DCPMM, experimental results show that HASDH achieves 1.43∼7.39× speedup for insertions, 2.08~9.63× speedup for searches, and 1.78~3.01× speedup for deletions, compared with start-of-the-art NVM-based hashing indexes.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122179364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Low-Cost Sequential Logic Circuit Design Considering Single Event Double-Node Upsets and Single Event Transients 考虑单事件双节点扰流和单事件瞬变的低成本顺序逻辑电路设计

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00038

R. Rajaei, M. Niemier, X. Hu

As CMOS device sizes continue to scale down, radiation-related reliability issues are of ever-growing concern. Single event double node upsets (SEDUs) in sequential logic and single event transients (SETs) in combinational logic are sources of high rate radiation-induced soft errors that can affect the functionality of logic circuits. This paper presents effective circuit-level solutions for combating SEDUs/SETs in nanoscale sequential and combinational logic circuits. More specifically, we propose and evaluate low-power latch and flip-flop circuits to mitigate SEDUs and SETs. Simulations with a 22 nm PTM model reveal that the proposed circuits offer full immunity against SEDUs, can better filter SET pulses, and simultaneously reduce design overhead when compared to prior work. As a representative example, simulation-based studies show that our designs offer up to 77% improvements in delay-power-area product, and can filter out up to 58% wider SET pulses when compared to the state-of-the-art.

随着CMOS器件尺寸的不断缩小，与辐射相关的可靠性问题日益受到关注。顺序逻辑中的单事件双节点扰动(SEDUs)和组合逻辑中的单事件瞬变(SETs)是高速率辐射引起的软误差的来源，可以影响逻辑电路的功能。本文提出了在纳米级顺序和组合逻辑电路中对抗sedu / set的有效电路级解决方案。更具体地说，我们提出并评估了低功耗锁存器和触发器电路，以减轻sedu和set。利用22 nm PTM模型进行的仿真表明，与之前的工作相比，所提出的电路对sedu具有完全的抗扰性，可以更好地过滤SET脉冲，同时降低了设计开销。作为一个代表性的例子，基于仿真的研究表明，我们的设计在延迟功率面积产品方面提供了高达77%的改进，并且与最先进的相比，可以过滤出高达58%的宽SET脉冲。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 IEEE 39th International Conference on Computer Design (ICCD)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀