首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
X-ray: Discovering DRAM Internal Structure and Error Characteristics by Issuing Memory Commands X射线:通过发出内存命令发现DRAM内部结构和错误特征
IF 2.3 3区 计算机科学 Pub Date : 2023-07-17 DOI: 10.1109/LCA.2023.3296153
Hwayong Nam;Seungmin Baek;Minbok Wi;Michael Jaemin Kim;Jaehyun Park;Chihun Song;Nam Sung Kim;Jung Ho Ahn
The demand for accurate information about the internal structure and characteristics of DRAM has been on the rise. Recent studies have explored the structure and characteristics of DRAM to improve processing in memory, enhance reliability, and mitigate a vulnerability known as rowhammer. However, DRAM manufacturers only disclose limited information through official documents, making it difficult to find specific information about actual DRAM devices. This paper presents reliable findings on the internal structure and characteristics of DRAM using activate-induced bitflips (AIBs), retention time test, and row-copy operation. While previous studies have attempted to understand the internal behaviors of DRAM devices, they have only shown results without identifying the causes or have analyzed DRAM modules rather than individual chips. We first uncover the size, structure, and operation of DRAM subarrays and verify our findings on the characteristics of DRAM. Then, we correct misunderstood information related to AIBs and demonstrate experimental results supporting the cause of rowhammer.
对有关DRAM内部结构和特性的准确信息的需求一直在增长。最近的研究探索了DRAM的结构和特性,以改善内存中的处理,提高可靠性,并减轻被称为rowhammer的漏洞。然而,DRAM制造商只通过官方文件披露有限的信息,因此很难找到有关实际DRAM设备的具体信息。本文使用激活诱导位翻转(AIB)、保留时间测试和行复制操作,对DRAM的内部结构和特性进行了可靠的研究。虽然之前的研究试图了解DRAM器件的内部行为,但他们只显示了结果,没有确定原因,也没有分析DRAM模块而不是单个芯片。我们首先揭示了DRAM子阵列的大小、结构和操作,并验证了我们对DRAM特性的发现。然后,我们纠正了与AIBs相关的误解信息,并展示了支持rowhammer原因的实验结果。
{"title":"X-ray: Discovering DRAM Internal Structure and Error Characteristics by Issuing Memory Commands","authors":"Hwayong Nam;Seungmin Baek;Minbok Wi;Michael Jaemin Kim;Jaehyun Park;Chihun Song;Nam Sung Kim;Jung Ho Ahn","doi":"10.1109/LCA.2023.3296153","DOIUrl":"10.1109/LCA.2023.3296153","url":null,"abstract":"The demand for accurate information about the internal structure and characteristics of DRAM has been on the rise. Recent studies have explored the structure and characteristics of DRAM to improve processing in memory, enhance reliability, and mitigate a vulnerability known as rowhammer. However, DRAM manufacturers only disclose limited information through official documents, making it difficult to find specific information about actual DRAM devices. This paper presents reliable findings on the internal structure and characteristics of DRAM using activate-induced bitflips (AIBs), retention time test, and row-copy operation. While previous studies have attempted to understand the internal behaviors of DRAM devices, they have only shown results without identifying the causes or have analyzed DRAM modules rather than individual chips. We first uncover the size, structure, and operation of DRAM subarrays and verify our findings on the characteristics of DRAM. Then, we correct misunderstood information related to AIBs and demonstrate experimental results supporting the cause of rowhammer.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49364167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
LADIO: Leakage-Aware Direct I/O for I/O-Intensive Workloads 针对I/O密集型工作负载的泄漏感知直接I/O
IF 2.3 3区 计算机科学 Pub Date : 2023-07-03 DOI: 10.1109/LCA.2023.3290427
Ipoom Jeong;Jiaqi Lou;Yongseok Son;Yongjoo Park;Yifan Yuan;Nam Sung Kim
The advancement in I/O technology has posed an unprecedented demand for high-performance processing on I/O data, leading to the development of Data Direct I/O (DDIO) technology. DDIO improves I/O processing efficiency by directly injecting all inbound I/O data into the last-level cache (LLC) in cooperation with any type of I/O device. Nonetheless, in certain scenarios with more than one I/O applications, DDIO may have sub-optimal performance caused by interference inside the LLC, resulting in the degradation of system performance. Especially, in this paper, we demonstrate that storage I/O on modern high-performance NVMe SSDs hardly benefits from DDIO, sometimes causing inefficient use of the shared LLC due to the “leaky DMA problem”. To address this problem, we propose LADIO, an adaptive approach that mitigates inter-application interference by dynamically controlling the DDIO functionality and reallocating LLC ways based on the leakage and locality of storage I/O data, respectively. In scenarios with heavy I/O interference, LADIO improves the throughput of network-intensive applications by 20% while maintaining that of storage-intensive applications.
随着I/O技术的进步,对高性能的I/O数据处理提出了前所未有的要求,导致了DDIO (data Direct I/O)技术的发展。DDIO通过与任何类型的I/O设备合作,直接将所有入站I/O数据注入最后一级缓存(LLC),从而提高了I/O处理效率。尽管如此,在某些具有多个I/O应用程序的场景中,由于LLC内部的干扰,DDIO的性能可能不太理想,从而导致系统性能下降。特别是,在本文中,我们证明了现代高性能NVMe ssd上的存储I/O很难从DDIO中受益,有时由于“泄漏DMA问题”导致共享LLC的低效使用。为了解决这个问题,我们提出了LADIO,这是一种自适应方法,通过动态控制DDIO功能和根据存储I/O数据的泄漏和局域性重新分配LLC方式来减轻应用程序间的干扰。在有严重I/O干扰的场景中,radio在保持存储密集型应用的吞吐量的同时,将网络密集型应用的吞吐量提高20%。
{"title":"LADIO: Leakage-Aware Direct I/O for I/O-Intensive Workloads","authors":"Ipoom Jeong;Jiaqi Lou;Yongseok Son;Yongjoo Park;Yifan Yuan;Nam Sung Kim","doi":"10.1109/LCA.2023.3290427","DOIUrl":"10.1109/LCA.2023.3290427","url":null,"abstract":"The advancement in I/O technology has posed an unprecedented demand for high-performance processing on I/O data, leading to the development of Data Direct I/O (DDIO) technology. DDIO improves I/O processing efficiency by directly injecting all inbound I/O data into the last-level cache (LLC) in cooperation with any type of I/O device. Nonetheless, in certain scenarios with more than one I/O applications, DDIO may have sub-optimal performance caused by interference inside the LLC, resulting in the degradation of system performance. Especially, in this paper, we demonstrate that storage I/O on modern high-performance NVMe SSDs hardly benefits from DDIO, sometimes causing inefficient use of the shared LLC due to the “leaky DMA problem”. To address this problem, we propose \u0000<monospace>LADIO</monospace>\u0000, an adaptive approach that mitigates inter-application interference by dynamically controlling the DDIO functionality and reallocating LLC ways based on the leakage and locality of storage I/O data, respectively. In scenarios with heavy I/O interference, \u0000<monospace>LADIO</monospace>\u0000 improves the throughput of network-intensive applications by 20% while maintaining that of storage-intensive applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48507765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
T-CAT: Dynamic Cache Allocation for Tiered Memory Systems With Memory Interleaving T-CAT:具有内存交错的分层内存系统的动态缓存分配
IF 2.3 3区 计算机科学 Pub Date : 2023-06-28 DOI: 10.1109/LCA.2023.3290197
Hwanjun Lee;Seunghak Lee;Yeji Jung;Daehoon Kim
New memory interconnect technology, such as Intel's Compute Express Link (CXL), helps to expand memory bandwidth and capacity by adding CPU-less NUMA nodes to the main memory system, addressing the growing memory wall challenge. Consequently, modern computing systems embrace the heterogeneity in memory systems, composing the memory systems with a tiered memory system with near and far memory (e.g., local DRAM and CXL-DRAM). However, adopting NUMA interleaving, which can improve performance by exploiting node-level parallelism and utilizing aggregate bandwidth, to the tiered memory systems can face challenges due to differences in the access latency between the two types of memory, leading to potential performance degradation for memory-intensive workloads. By tackling the challenges, we first investigate the effects of the NUMA interleaving on the performance of the tiered memory systems. We observe that while NUMA interleaving is essential for applications demanding high memory bandwidth, it can negatively impact the performance of applications demanding low memory bandwidth. Next, we propose a dynamic cache management, called T-CAT, which partitions the last-level cache between near and far memory, aiming to mitigate performance degradation by accessing far memory. T-CAT attempts to reduce the difference in the average access latency between near and far memory by re-sizing the cache partitions. Through dynamic cache management, T-CAT can preserve the performance benefits of NUMA interleaving while mitigating performance degradation by the far memory accesses. Our experimental results show that T-CAT improves performance by up to 17% compared to cases with NUMA interleaving without the cache management.
新的内存互连技术,如Intel的Compute Express Link(CXL),通过在主内存系统中添加无CPU的NUMA节点,有助于扩展内存带宽和容量,解决不断增长的内存墙挑战。因此,现代计算系统包含了存储器系统的异构性,将存储器系统与具有近存储器和远存储器的分层存储器系统(例如,本地DRAM和CXL-DRAM)组合在一起。然而,由于两种类型的内存之间的访问延迟差异,将NUMA交错应用于分层内存系统可能会面临挑战,这可能会导致内存密集型工作负载的性能下降。NUMA交错可以通过利用节点级并行性和利用聚合带宽来提高性能。通过应对这些挑战,我们首先研究了NUMA交织对分层存储系统性能的影响。我们观察到,虽然NUMA交织对于要求高内存带宽的应用程序至关重要,但它可能会对要求低内存带宽的程序的性能产生负面影响。接下来,我们提出了一种称为T-CAT的动态缓存管理,它在近内存和远内存之间划分最后一级缓存,旨在通过访问远内存来缓解性能下降。T-CAT试图通过重新调整缓存分区的大小来减少近内存和远内存之间平均访问延迟的差异。通过动态缓存管理,T-CAT可以保留NUMA交织的性能优势,同时减轻远程内存访问造成的性能下降。我们的实验结果表明,与在没有缓存管理的情况下使用NUMA交织的情况相比,T-CAT的性能提高了17%。
{"title":"T-CAT: Dynamic Cache Allocation for Tiered Memory Systems With Memory Interleaving","authors":"Hwanjun Lee;Seunghak Lee;Yeji Jung;Daehoon Kim","doi":"10.1109/LCA.2023.3290197","DOIUrl":"10.1109/LCA.2023.3290197","url":null,"abstract":"New memory interconnect technology, such as Intel's Compute Express Link (CXL), helps to expand memory bandwidth and capacity by adding CPU-less NUMA nodes to the main memory system, addressing the growing memory wall challenge. Consequently, modern computing systems embrace the heterogeneity in memory systems, composing the memory systems with a tiered memory system with near and far memory (e.g., local DRAM and CXL-DRAM). However, adopting NUMA interleaving, which can improve performance by exploiting node-level parallelism and utilizing aggregate bandwidth, to the tiered memory systems can face challenges due to differences in the access latency between the two types of memory, leading to potential performance degradation for memory-intensive workloads. By tackling the challenges, we first investigate the effects of the NUMA interleaving on the performance of the tiered memory systems. We observe that while NUMA interleaving is essential for applications demanding high memory bandwidth, it can negatively impact the performance of applications demanding low memory bandwidth. Next, we propose a dynamic cache management, called \u0000<monospace>T-CAT</monospace>\u0000, which partitions the last-level cache between near and far memory, aiming to mitigate performance degradation by accessing far memory. \u0000<monospace>T-CAT</monospace>\u0000 attempts to reduce the difference in the average access latency between near and far memory by re-sizing the cache partitions. Through dynamic cache management, \u0000<monospace>T-CAT</monospace>\u0000 can preserve the performance benefits of NUMA interleaving while mitigating performance degradation by the far memory accesses. Our experimental results show that \u0000<monospace>T-CAT</monospace>\u0000 improves performance by up to 17% compared to cases with NUMA interleaving without the cache management.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48978905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hardware Accelerated Reusable Merkle Tree Generation for Bitcoin Blockchain Headers 用于比特币区块链头的硬件加速可重用Merkle树生成
IF 2.3 3区 计算机科学 Pub Date : 2023-06-28 DOI: 10.1109/LCA.2023.3289515
Kiseok Jeon;Junghee Lee;Bumsoo Kim;James J. Kim
As the value of Bitcoin increases, the difficulty level of mining keeps increasing. This is generally addressed with application-specific integrated circuits (ASIC), but block candidates are still created by the software. The overhead of block candidate generation is relatively growing because the hash computation is boosted by ASIC. Additionally, it is getting harder to find the target nonce; If it is not found for a block candidate, a new block candidate must be generated. A new candidate can be generated to reduce the overhead of block candidate generation by modifying the coinbase without selecting and verifying transactions again. To this end, we propose a hardware accelerator for generating Merkle trees efficiently. The hash computation for Merkle tree generation is conducted with ASIC to reduce the overhead of block candidate generation, and the tree with only the modified coinbase is rapidly regenerated by reusing the intermediate results of the previously generated tree. Our simulation results demonstrate that the execution time can be reduced by up to 98.92% and power consumption by up to 99.73% when the number of transactions in a tree is 2048.
随着比特币价值的增加,挖矿的难度也在不断增加。这通常通过专用集成电路(ASIC)来解决,但块候选者仍然由软件创建。块候选生成的开销相对增长,因为ASIC提高了哈希计算。此外,查找目标nonce变得越来越困难;如果找不到块候选,则必须生成新的块候选。可以生成新的候选,以通过修改coinbase来减少块候选生成的开销,而无需再次选择和验证事务。为此,我们提出了一种有效生成Merkle树的硬件加速器。Merkle树生成的哈希计算是用ASIC进行的,以减少块候选生成的开销,并且通过重用先前生成的树的中间结果来快速再生仅具有修改的coinbase的树。我们的仿真结果表明,当树中的事务数为2048时,执行时间可以减少98.92%,功耗可以减少99.73%。
{"title":"Hardware Accelerated Reusable Merkle Tree Generation for Bitcoin Blockchain Headers","authors":"Kiseok Jeon;Junghee Lee;Bumsoo Kim;James J. Kim","doi":"10.1109/LCA.2023.3289515","DOIUrl":"10.1109/LCA.2023.3289515","url":null,"abstract":"As the value of Bitcoin increases, the difficulty level of mining keeps increasing. This is generally addressed with application-specific integrated circuits (ASIC), but block candidates are still created by the software. The overhead of block candidate generation is relatively growing because the hash computation is boosted by ASIC. Additionally, it is getting harder to find the target nonce; If it is not found for a block candidate, a new block candidate must be generated. A new candidate can be generated to reduce the overhead of block candidate generation by modifying the coinbase without selecting and verifying transactions again. To this end, we propose a hardware accelerator for generating Merkle trees efficiently. The hash computation for Merkle tree generation is conducted with ASIC to reduce the overhead of block candidate generation, and the tree with only the modified coinbase is rapidly regenerated by reusing the intermediate results of the previously generated tree. Our simulation results demonstrate that the execution time can be reduced by up to 98.92% and power consumption by up to 99.73% when the number of transactions in a tree is 2048.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41591979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Containerized In-Storage Processing Model and Hardware Acceleration for Fully-Flexible Computational SSDs 用于全灵活计算固态硬盘的容器化存储内处理模型和硬件加速
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-27 DOI: 10.1109/lca.2023.3289828
Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Myoungsoo Jung
In-storage processing (ISP) efficiently examines large datasets but faces performance and security challenges. We introduce DockerSSD, a flexible ISP model that runs various applications near flash without modification. It employs lightweight OS-level virtualization in modern SSDs for faster ISP and better storage intelligence with a high flexiblity. DockerSSD reuses existing Docker container images for real-time data processing without altering the storage interface or runtime. Our design includes a new communication method and virtual firmware, alongside automated container-related network and I/O handling hardware. DockerSSD achieves a 2× speed improvement and reduces system-level power by 35.7%, on average.
存储内处理(ISP)可有效检查大型数据集,但面临着性能和安全方面的挑战。我们引入了 DockerSSD,这是一种灵活的 ISP 模式,可在闪存附近运行各种应用,无需修改。它在现代固态硬盘中采用了轻量级操作系统级虚拟化技术,以实现更快的 ISP 和更高灵活性的存储智能。DockerSSD 可重复使用现有的 Docker 容器镜像进行实时数据处理,而无需更改存储接口或运行时。我们的设计包括新的通信方法和虚拟固件,以及与容器相关的自动化网络和 I/O 处理硬件。DockerSSD 的速度提高了 2 倍,系统级功耗平均降低了 35.7%。
{"title":"Containerized In-Storage Processing Model and Hardware Acceleration for Fully-Flexible Computational SSDs","authors":"Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Myoungsoo Jung","doi":"10.1109/lca.2023.3289828","DOIUrl":"https://doi.org/10.1109/lca.2023.3289828","url":null,"abstract":"In-storage processing (ISP) efficiently examines large datasets but faces performance and security challenges. We introduce DockerSSD, a flexible ISP model that runs various applications near flash without modification. It employs lightweight OS-level virtualization in modern SSDs for faster ISP and better storage intelligence with a high flexiblity. DockerSSD reuses existing Docker container images for real-time data processing without altering the storage interface or runtime. Our design includes a new communication method and virtual firmware, alongside automated container-related network and I/O handling hardware. DockerSSD achieves a 2× speed improvement and reduces system-level power by 35.7%, on average.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guard Cache: Creating Noisy Side-Channels 保护缓存:创建有噪声的侧通道
IF 2.3 3区 计算机科学 Pub Date : 2023-06-27 DOI: 10.1109/LCA.2023.3289710
Fernando Mosquera;Krishna Kavi;Gayatri Mehta;Lizy John
Microarchitectural innovations such as deep cache hierarchies, out-of-order execution, branch prediction and speculative execution have made possible the design of processors that meet ever-increasing demands for performance. However, these innovations have inadvertently introduced vulnerabilities, which are exploited by side-channel attacks and attacks relying on speculative executions. Mitigating the attacks while preserving the performance has been a challenge. In this letter we present an approach to obfuscate cache timing, making it more difficult for side-channel attacks to succeed. We create false cache hits using a small Guard Cache with randomization, and false cache misses by randomly evicting cache lines. We show that our false hits and false misses cause very minimal performance penalties and our obfuscation can make it difficult for common side-channel attacks such as Prime &Probe, Flush &Reload or Evict &Time to succeed.
微体系结构创新,如深度缓存层次结构、无序执行、分支预测和推测执行,使处理器的设计能够满足不断增长的性能需求。然而,这些创新无意中引入了漏洞,这些漏洞被侧通道攻击和依赖推测执行的攻击所利用。在保持性能的同时减少攻击一直是一个挑战。在这封信中,我们提出了一种模糊缓存定时的方法,使侧通道攻击更难成功。我们使用带有随机化的小型保护缓存创建虚假缓存命中,并通过随机驱逐缓存行创建虚假缓存未命中。我们表明,我们的错误命中和错误未命中造成的性能损失非常小,我们的混淆会使常见的侧通道攻击(如Prime&Probe、Flush&Reload或Evict&Time)难以成功。
{"title":"Guard Cache: Creating Noisy Side-Channels","authors":"Fernando Mosquera;Krishna Kavi;Gayatri Mehta;Lizy John","doi":"10.1109/LCA.2023.3289710","DOIUrl":"10.1109/LCA.2023.3289710","url":null,"abstract":"Microarchitectural innovations such as deep cache hierarchies, out-of-order execution, branch prediction and speculative execution have made possible the design of processors that meet ever-increasing demands for performance. However, these innovations have inadvertently introduced vulnerabilities, which are exploited by side-channel attacks and attacks relying on speculative executions. Mitigating the attacks while preserving the performance has been a challenge. In this letter we present an approach to obfuscate cache timing, making it more difficult for side-channel attacks to succeed. We create \u0000<italic>false cache hits</i>\u0000 using a small \u0000<italic>Guard Cache</i>\u0000 with randomization, and \u0000<italic>false cache misses</i>\u0000 by randomly evicting cache lines. We show that our \u0000<italic>false hits</i>\u0000 and \u0000<italic>false misses</i>\u0000 cause very minimal performance penalties and our obfuscation can make it difficult for common side-channel attacks such as Prime &Probe, Flush &Reload or Evict &Time to succeed.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41685733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISA TURBULENCE:利用基于距离的 ISA 在 GPU 上进行具有完备性的无序执行
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-26 DOI: 10.1109/LCA.2023.3289317
Reoma Matsuo;Toru Koizumi;Hidetsugu Irie;Shuichi Sakai;Ryota Shioya
A graphics processing unit (GPU) is a processor that achieves high throughput by exploiting data parallelism. We found that many GPU workloads also contain instruction-level parallelism that can be extracted through out-of-order execution to provide additional performance improvement opportunities. We propose the TURBULENCE architecture for very low-cost out-of-order execution on GPUs. TURBULENCE consists of a novel ISA that introduces the concept of referencing operands by inter-instruction distance instead of register numbers, and a novel microarchitecture that executes the novel ISA. This distance-based operand has the property of not causing false dependencies. By exploiting this property, we achieve cost-effective out-of-order execution on GPUs without introducing expensive hardware such as a rename logic and a load-store queue. Simulation results show that TURBULENCE improves performance by 17.6% without increasing energy consumption over an existing GPU.
图形处理器(GPU)是一种通过利用数据并行性实现高吞吐量的处理器。我们发现,许多 GPU 工作负载也包含指令级并行性,可以通过失序执行来提取指令级并行性,从而提供额外的性能提升机会。我们提出了 TURBULENCE 架构,用于在 GPU 上实现极低成本的失序执行。TURBULENCE 由一个新颖的 ISA 和一个执行新颖 ISA 的新颖微体系结构组成,前者引入了通过指令间距离而不是寄存器编号来引用操作数的概念。这种基于距离的操作数具有不会造成错误依赖的特性。利用这一特性,我们无需引入重命名逻辑和加载存储队列等昂贵的硬件,就能在 GPU 上实现经济高效的无序执行。仿真结果表明,与现有的 GPU 相比,TURBULENCE 在不增加能耗的情况下将性能提高了 17.6%。
{"title":"TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISA","authors":"Reoma Matsuo;Toru Koizumi;Hidetsugu Irie;Shuichi Sakai;Ryota Shioya","doi":"10.1109/LCA.2023.3289317","DOIUrl":"10.1109/LCA.2023.3289317","url":null,"abstract":"A graphics processing unit (GPU) is a processor that achieves high throughput by exploiting data parallelism. We found that many GPU workloads also contain instruction-level parallelism that can be extracted through out-of-order execution to provide additional performance improvement opportunities. We propose the TURBULENCE architecture for very low-cost out-of-order execution on GPUs. TURBULENCE consists of a novel ISA that introduces the concept of referencing operands by inter-instruction distance instead of register numbers, and a novel microarchitecture that executes the novel ISA. This distance-based operand has the property of not causing false dependencies. By exploiting this property, we achieve cost-effective out-of-order execution on GPUs without introducing expensive hardware such as a rename logic and a load-store queue. Simulation results show that TURBULENCE improves performance by 17.6% without increasing energy consumption over an existing GPU.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Practical 128-Bit General Purpose Microarchitectures 面向实用的128位通用微体系结构
IF 2.3 3区 计算机科学 Pub Date : 2023-06-20 DOI: 10.1109/LCA.2023.3287762
Chandana S. Deshpande;Arthur Perais;Frédéric Pétrot
Intel introduced 5-level paging mode to support 57-bit virtual address space in 2017. This, coupled to paradigms where backup storage can be accessed through load and store instructions (e.g., non volatile memories), lets us envision a future in which a 64-bit address space has become insufficient. In that event, the straightforward solution would be to adopt a flat 128-bit address space. In this early stage letter, we conduct high-level experiments that lead us to suggest a possible general-purpose processor micro-architecture providing 128-bit support with limited hardware cost.
英特尔在2017年推出了5级分页模式,以支持57位虚拟地址空间。这一点,再加上可以通过加载和存储指令(例如,非易失性存储器)访问备份存储的模式,让我们可以设想64位地址空间不足的未来。在这种情况下,直接的解决方案是采用平坦的128位地址空间。在这封早期的信中,我们进行了高级别的实验,从而提出了一种可能的通用处理器微架构,该架构以有限的硬件成本提供128位支持。
{"title":"Toward Practical 128-Bit General Purpose Microarchitectures","authors":"Chandana S. Deshpande;Arthur Perais;Frédéric Pétrot","doi":"10.1109/LCA.2023.3287762","DOIUrl":"10.1109/LCA.2023.3287762","url":null,"abstract":"Intel introduced 5-level paging mode to support 57-bit virtual address space in 2017. This, coupled to paradigms where backup storage can be accessed through load and store instructions (e.g., non volatile memories), lets us envision a future in which a 64-bit address space has become insufficient. In that event, the straightforward solution would be to adopt a flat 128-bit address space. In this early stage letter, we conduct high-level experiments that lead us to suggest a possible general-purpose processor micro-architecture providing 128-bit support with limited hardware cost.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48523174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DVFaaS: Leveraging DVFS for FaaS Workflows DVFaaS:为FaaS工作流利用DVFS
IF 2.3 3区 计算机科学 Pub Date : 2023-06-20 DOI: 10.1109/LCA.2023.3288089
Achilleas Tzenetopoulos;Dimosthenis Masouros;Dimitrios Soudris;Sotirios Xydis
In this letter, we propose DVFaaS, a per-core DVFS framework that utilizes control systems theory to assign just-enough frequency for the purpose of addressing the QoS requirements on serverless workflows comprising unseen functions. DVFaaS exploits the intermittent nature of serverless workflows, which enables staged control on distinguishable functions, which jointly contribute to the end-to-end latency. Our results show that DVFaaS considerably outperforms related work, reducing power consumption by up to 22%, with 2x fewer QoS violations.
在这封信中,我们提出了DVFaaS,这是一个基于核心的DVFS框架,它利用控制系统理论来分配足够的频率,以满足包括看不见的功能的无服务器工作流的QoS要求。DVFaaS利用了无服务器工作流的间歇性,实现了对可区分功能的分阶段控制,这共同导致了端到端延迟。我们的结果表明,DVFaaS显著优于相关工作,功耗降低了22%,QoS违规减少了2倍。
{"title":"DVFaaS: Leveraging DVFS for FaaS Workflows","authors":"Achilleas Tzenetopoulos;Dimosthenis Masouros;Dimitrios Soudris;Sotirios Xydis","doi":"10.1109/LCA.2023.3288089","DOIUrl":"10.1109/LCA.2023.3288089","url":null,"abstract":"In this letter, we propose \u0000<italic>DVFaaS</i>\u0000, a per-core DVFS framework that utilizes control systems theory to assign \u0000<italic>just-enough</i>\u0000 frequency for the purpose of addressing the QoS requirements on serverless workflows comprising unseen functions. \u0000<italic>DVFaaS</i>\u0000 exploits the intermittent nature of serverless workflows, which enables staged control on distinguishable functions, which jointly contribute to the end-to-end latency. Our results show that \u0000<italic>DVFaaS</i>\u0000 considerably outperforms related work, reducing power consumption by up to 22%, with 2x fewer QoS violations.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46172046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design of a High-Performance, High-Endurance Key-Value SSD for Large-Key Workloads 面向大密钥工作负载的高性能、高持久键值SSD设计
IF 2.3 3区 计算机科学 Pub Date : 2023-06-02 DOI: 10.1109/LCA.2023.3282276
Chanyoung Park;Chun-Yi Liu;Kyungtae Kang;Mahmut Kandemir;Wonil Choi
Current KV-SSD design assumes a specific range of typical workloads, where the size of values is quite large while that of keys is relatively small. However, we find that (i) there exist another spectrum of workloads, whose key sizes are relatively large, compared to their value sizes, and (ii) the current KV-SSD design suffers from long tail latencies and low storage utilization under such large-key workloads. To this end, we present novel design of a KV-SSD (called LK-SSD), which can reduce tail latences and increase storage utilization under large-key workloads, and add an enhancement to it for longer device lifetime. Through extensive experiments, we show that LK-SSD is more suitable design for the large-key workloads, and also available for the typical workloads.
当前的KV-SSD设计假设了特定范围的典型工作负载,其中值的大小相当大,而键的大小相对较小。然而,我们发现(i)存在另一类工作负载,其密钥大小与其值大小相比相对较大;(ii)当前的KV-SSD设计在这种大密钥工作负载下存在长尾延迟和低存储利用率。为此,我们提出了一种新颖的KV-SSD(称为LK-SSD)设计,它可以减少尾部延迟,提高大密钥工作负载下的存储利用率,并对其进行增强,以延长设备寿命。通过大量的实验,我们证明LK-SSD更适合于大密钥工作负载的设计,也适用于典型的工作负载。
{"title":"Design of a High-Performance, High-Endurance Key-Value SSD for Large-Key Workloads","authors":"Chanyoung Park;Chun-Yi Liu;Kyungtae Kang;Mahmut Kandemir;Wonil Choi","doi":"10.1109/LCA.2023.3282276","DOIUrl":"https://doi.org/10.1109/LCA.2023.3282276","url":null,"abstract":"Current KV-SSD design assumes a specific range of typical workloads, where the size of values is quite large while that of keys is relatively small. However, we find that (i) there exist another spectrum of workloads, whose key sizes are relatively large, compared to their value sizes, and (ii) the current KV-SSD design suffers from long tail latencies and low storage utilization under such large-key workloads. To this end, we present novel design of a KV-SSD (called LK-SSD), which can reduce tail latences and increase storage utilization under large-key workloads, and add an enhancement to it for longer device lifetime. Through extensive experiments, we show that LK-SSD is more suitable design for the large-key workloads, and also available for the typical workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1