首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Design of a High-Performance, High-Endurance Key-Value SSD for Large-Key Workloads 面向大密钥工作负载的高性能、高持久键值SSD设计
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-02 DOI: 10.1109/LCA.2023.3282276
Chanyoung Park;Chun-Yi Liu;Kyungtae Kang;Mahmut Kandemir;Wonil Choi
Current KV-SSD design assumes a specific range of typical workloads, where the size of values is quite large while that of keys is relatively small. However, we find that (i) there exist another spectrum of workloads, whose key sizes are relatively large, compared to their value sizes, and (ii) the current KV-SSD design suffers from long tail latencies and low storage utilization under such large-key workloads. To this end, we present novel design of a KV-SSD (called LK-SSD), which can reduce tail latences and increase storage utilization under large-key workloads, and add an enhancement to it for longer device lifetime. Through extensive experiments, we show that LK-SSD is more suitable design for the large-key workloads, and also available for the typical workloads.
当前的KV-SSD设计假设了特定范围的典型工作负载,其中值的大小相当大,而键的大小相对较小。然而,我们发现(i)存在另一类工作负载,其密钥大小与其值大小相比相对较大;(ii)当前的KV-SSD设计在这种大密钥工作负载下存在长尾延迟和低存储利用率。为此,我们提出了一种新颖的KV-SSD(称为LK-SSD)设计,它可以减少尾部延迟,提高大密钥工作负载下的存储利用率,并对其进行增强,以延长设备寿命。通过大量的实验,我们证明LK-SSD更适合于大密钥工作负载的设计,也适用于典型的工作负载。
{"title":"Design of a High-Performance, High-Endurance Key-Value SSD for Large-Key Workloads","authors":"Chanyoung Park;Chun-Yi Liu;Kyungtae Kang;Mahmut Kandemir;Wonil Choi","doi":"10.1109/LCA.2023.3282276","DOIUrl":"https://doi.org/10.1109/LCA.2023.3282276","url":null,"abstract":"Current KV-SSD design assumes a specific range of typical workloads, where the size of values is quite large while that of keys is relatively small. However, we find that (i) there exist another spectrum of workloads, whose key sizes are relatively large, compared to their value sizes, and (ii) the current KV-SSD design suffers from long tail latencies and low storage utilization under such large-key workloads. To this end, we present novel design of a KV-SSD (called LK-SSD), which can reduce tail latences and increase storage utilization under large-key workloads, and add an enhancement to it for longer device lifetime. Through extensive experiments, we show that LK-SSD is more suitable design for the large-key workloads, and also available for the typical workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"149-152"},"PeriodicalIF":2.3,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Kobold: Simplified Cache Coherence for Cache-Attached Accelerators Kobold:简化缓存连接加速器的缓存一致性
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-04-21 DOI: 10.1109/LCA.2023.3269399
Jennifer Brana;Brian C. Schwedock;Yatin A. Manerkar;Nathan Beckmann
The ever-increasing cost of data movement in computer systems is driving a new era of data-centric computing. One of the most common data-centric paradigms is near-data computing (NDC), where accelerators are placed inside the memory hierarchy to avoid the costly transfer of data to the core. NDC systems show immense potential to improve performance and energy efficiency. Unfortunately, adding accelerators into the memory hierarchy incurs significant complexity for system integration because accelerators often require cache-coherent access to memory. The complex coherence protocols required to handle both cores and cache-attached accelerators result in significantly higher verification costs as well as an increase in directory state and on-chip network traffic. Furthermore, these mechanisms can cause cache pollution and worsen baseline processor performance. To simplify the integration of cache-attached accelerators, we present Kobold, a new coherence protocol and implementation which restricts the added complexity of an accelerator to its local tile. Kobold introduces a new directory structure within the L2 cache to track the accelerator's private cache and maintain coherence between the core and accelerator. A minor modification to the LLC protocol also enables accelerators to improve performance by bypassing the local L2. We verified Kobold's stable-state coherence protocols using the Murphi model checker and estimated area overhead using Cacti 7. Kobold simplifies integration of cache-attached accelerators, adds only 0.09% area over the baseline caches, and provides clear performance advantages versus naïve extensions of existing directory coherence protocols.
计算机系统中不断增长的数据移动成本正在推动一个以数据为中心的计算的新时代。最常见的以数据为中心的模式之一是近数据计算(NDC),其中加速器被放置在内存层次结构中,以避免将数据传输到核心的成本高昂。NDC系统在提高性能和能源效率方面显示出巨大潜力。不幸的是,将加速器添加到内存层次结构中会导致系统集成非常复杂,因为加速器通常需要对内存进行高速缓存一致访问。处理核心和高速缓存连接的加速器所需的复杂一致性协议导致显著更高的验证成本以及目录状态和片上网络流量的增加。此外,这些机制可能会导致缓存污染,并恶化基线处理器性能。为了简化连接缓存的加速器的集成,我们提出了Kobold,这是一种新的一致性协议和实现,它将加速器的复杂性限制在其本地瓦片上。Kobold在二级缓存中引入了一种新的目录结构,以跟踪加速器的专用缓存,并保持内核和加速器之间的一致性。对LLC协议的微小修改也使加速器能够通过绕过本地L2来提高性能。我们使用Murphi模型检查器验证了Kobold的稳定状态一致性协议,并使用Cacti 7估计了区域开销。Kobold简化了缓存连接加速器的集成,仅比基线缓存增加0.09%的面积,与现有目录一致性协议的幼稚扩展相比,提供了明显的性能优势。
{"title":"Kobold: Simplified Cache Coherence for Cache-Attached Accelerators","authors":"Jennifer Brana;Brian C. Schwedock;Yatin A. Manerkar;Nathan Beckmann","doi":"10.1109/LCA.2023.3269399","DOIUrl":"10.1109/LCA.2023.3269399","url":null,"abstract":"The ever-increasing cost of data movement in computer systems is driving a new era of data-centric computing. One of the most common data-centric paradigms is near-data computing (NDC), where accelerators are placed \u0000<italic>inside</i>\u0000 the memory hierarchy to avoid the costly transfer of data to the core. NDC systems show immense potential to improve performance and energy efficiency. Unfortunately, adding accelerators into the memory hierarchy incurs significant complexity for system integration because accelerators often require cache-coherent access to memory. The complex coherence protocols required to handle both cores and cache-attached accelerators result in significantly higher verification costs as well as an increase in directory state and on-chip network traffic. Furthermore, these mechanisms can cause cache pollution and worsen baseline processor performance. To simplify the integration of cache-attached accelerators, we present Kobold, a new coherence protocol and implementation which restricts the added complexity of an accelerator to its local tile. Kobold introduces a new directory structure within the L2 cache to track the accelerator's private cache and maintain coherence between the core and accelerator. A minor modification to the LLC protocol also enables accelerators to improve performance by bypassing the local L2. We verified Kobold's stable-state coherence protocols using the Murphi model checker and estimated area overhead using Cacti 7. Kobold simplifies integration of cache-attached accelerators, adds only 0.09% area over the baseline caches, and provides clear performance advantages versus naïve extensions of existing directory coherence protocols.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"41-44"},"PeriodicalIF":2.3,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43340299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays Canal:用于粗粒度可重构阵列的柔性互连生成器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-04-19 DOI: 10.1109/LCA.2023.3268126
Jackson Melchert;Keyi Zhang;Yuchen Mei;Mark Horowitz;Christopher Torng;Priyanka Raina
The architecture of a coarse-grained reconfigurable array (CGRA) interconnect has a significant effect on not only the flexibility of the resulting accelerator, but also its power, performance, and area. Design decisions that have complex trade-offs need to be explored to maintain efficiency and performance across a variety of evolving applications. This paper presents Canal, a Python-embedded domain-specific language (eDSL) and compiler for specifying and generating reconfigurable interconnects for CGRAs. Canal uses a graph-based intermediate representation (IR) that allows for easy hardware generation and tight integration with place and route tools. We evaluate Canal by constructing both a fully static interconnect and a hybrid interconnect with ready-valid signaling, and by conducting design space exploration of the interconnect architecture by modifying the switch box topology, the number of routing tracks, and the interconnect tile connections. Through the use of a graph-based IR for CGRA interconnects, the eDSL, and the interconnect generation system, Canal enables fast design space exploration and creation of CGRA interconnects.
粗粒度可重构阵列(CGRA)互连的体系结构不仅对所得到的加速器的灵活性有重要影响,而且对其功率、性能和面积也有重要影响。需要探索具有复杂权衡的设计决策,以便在各种不断发展的应用程序中保持效率和性能。本文介绍了一种python嵌入式领域特定语言(eDSL)和编译器,用于为CGRAs指定和生成可重构互连。Canal使用一种基于图形的中间表示(IR),它允许简单的硬件生成和与位置和路由工具的紧密集成。我们通过构建一个完全静态的互连和一个具有现成有效信号的混合互连来评估Canal,并通过修改开关箱拓扑、路由轨道数量和互连瓦片连接来进行互连架构的设计空间探索。通过对CGRA互连、eDSL和互连生成系统使用基于图形的IR, Canal实现了CGRA互连的快速设计空间探索和创建。
{"title":"Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays","authors":"Jackson Melchert;Keyi Zhang;Yuchen Mei;Mark Horowitz;Christopher Torng;Priyanka Raina","doi":"10.1109/LCA.2023.3268126","DOIUrl":"10.1109/LCA.2023.3268126","url":null,"abstract":"The architecture of a coarse-grained reconfigurable array (CGRA) interconnect has a significant effect on not only the flexibility of the resulting accelerator, but also its power, performance, and area. Design decisions that have complex trade-offs need to be explored to maintain efficiency and performance across a variety of evolving applications. This paper presents Canal, a Python-embedded domain-specific language (eDSL) and compiler for specifying and generating reconfigurable interconnects for CGRAs. Canal uses a graph-based intermediate representation (IR) that allows for easy hardware generation and tight integration with place and route tools. We evaluate Canal by constructing both a fully static interconnect and a hybrid interconnect with ready-valid signaling, and by conducting design space exploration of the interconnect architecture by modifying the switch box topology, the number of routing tracks, and the interconnect tile connections. Through the use of a graph-based IR for CGRA interconnects, the eDSL, and the interconnect generation system, Canal enables fast design space exploration and creation of CGRA interconnects.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"45-48"},"PeriodicalIF":2.3,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43724888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SmartIndex: Learning to Index Caches to Improve Performance SmartIndex:学习索引缓存以提高性能
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-04-05 DOI: 10.1109/LCA.2023.3264478
Kevin Weston;Farabi Mahmud;Vahid Janfaza;Abdullah Muzahid
Modern computers rely heavily on caches to achieve higher performance. Unfortunately, a cache indexing scheme can often cause an uneven distribution of addresses across cache sets resulting in many evictions of useful cache blocks. To address this issue, we propose SmartIndex, a self-optimized indexing scheme that leverages machine learning to actively learn the memory access pattern and dynamically adjust indexes to evenly distribute the cache lines across all sets in the cache, thereby reducing cache misses. Experimental results on a set of 26 memory-intensive applications show that for non-uniform applications, SmartIndex can reduce the misses per kilo instructions (MPKI) of a direct mapped cache by up to 39%, translating into an IPC speedup of 7.23% compared to the conventional power-of-two indexing scheme. Our experiments also show that SmartIndex can work with any cache associativity.
现代计算机严重依赖缓存来实现更高的性能。不幸的是,缓存索引方案通常会导致地址在缓存集之间的不均匀分布,从而导致许多有用的缓存块被移除。为了解决这个问题,我们提出了SmartIndex,这是一种自优化的索引方案,它利用机器学习来主动学习内存访问模式,并动态调整索引,以将缓存线均匀分布在缓存中的所有集合中,从而减少缓存未命中。在一组26个内存密集型应用程序上的实验结果表明,对于非均匀应用程序,SmartIndex可以将直接映射缓存的每千指令未命中率(MPKI)降低39%,与传统的二次幂索引方案相比,IPC加速率为7.23%。我们的实验还表明,SmartIndex可以与任何缓存关联性一起工作。
{"title":"SmartIndex: Learning to Index Caches to Improve Performance","authors":"Kevin Weston;Farabi Mahmud;Vahid Janfaza;Abdullah Muzahid","doi":"10.1109/LCA.2023.3264478","DOIUrl":"10.1109/LCA.2023.3264478","url":null,"abstract":"Modern computers rely heavily on caches to achieve higher performance. Unfortunately, a cache indexing scheme can often cause an uneven distribution of addresses across cache sets resulting in many evictions of useful cache blocks. To address this issue, we propose \u0000<sc>SmartIndex</small>\u0000, a self-optimized indexing scheme that leverages machine learning to actively learn the memory access pattern and dynamically adjust indexes to evenly distribute the cache lines across all sets in the cache, thereby reducing cache misses. Experimental results on a set of 26 memory-intensive applications show that for non-uniform applications, \u0000<sc>SmartIndex</small>\u0000 can reduce the misses per kilo instructions (MPKI) of a direct mapped cache by up to 39%, translating into an IPC speedup of 7.23% compared to the conventional power-of-two indexing scheme. Our experiments also show that \u0000<sc>SmartIndex</small>\u0000 can work with any cache associativity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"33-36"},"PeriodicalIF":2.3,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48921816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Intermediate Language for General Sparse Format Customization 通用稀疏格式自定义的中间语言
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-28 DOI: 10.1109/LCA.2023.3262610
Jie Liu;Zhongyuan Zhao;Zijian Ding;Benjamin Brock;Hongbo Rong;Zhiru Zhang
The inevitable trend of hardware specialization drives an increasing use of custom data formats in processing sparse workloads, which are typically memory-bound. These formats facilitate the automated generation of target-aware data layouts to improve memory access latency and bandwidth utilization. However, existing sparse tensor programming models and compilers offer little or no support for productively customizing the sparse formats. Moreover, since these frameworks adopt an attribute-based approach for format abstraction, they cannot easily be extended to support general format customization. To overcome this deficiency, we propose UniSparse, an intermediate language that provides a unified abstraction for representing and customizing sparse formats. We also develop a compiler leveraging the MLIR infrastructure, which supports adaptive customization of formats. We demonstrate the efficacy of our approach through experiments running commonly-used sparse linear algebra operations with hybrid formats on multiple different hardware targets, including an Intel CPU, an NVIDIA GPU, and a simulated processing-in-memory (PIM) device.
硬件专门化的必然趋势推动了在处理稀疏工作负载时越来越多地使用自定义数据格式,这些工作负载通常是内存受限的。这些格式有助于自动生成目标感知数据布局,从而改善内存访问延迟和带宽利用率。然而,现有的稀疏张量编程模型和编译器很少或根本不支持有效地定制稀疏格式。此外,由于这些框架采用基于属性的方法进行格式抽象,因此不容易对它们进行扩展以支持一般格式定制。为了克服这一缺陷,我们提出了UniSparse,这是一种中间语言,它为表示和定制稀疏格式提供了统一的抽象。我们还开发了一个利用MLIR基础设施的编译器,它支持自适应自定义格式。我们通过在多个不同的硬件目标(包括Intel CPU、NVIDIA GPU和模拟内存处理(PIM)设备)上以混合格式运行常用的稀疏线性代数操作的实验,证明了我们方法的有效性。
{"title":"An Intermediate Language for General Sparse Format Customization","authors":"Jie Liu;Zhongyuan Zhao;Zijian Ding;Benjamin Brock;Hongbo Rong;Zhiru Zhang","doi":"10.1109/LCA.2023.3262610","DOIUrl":"https://doi.org/10.1109/LCA.2023.3262610","url":null,"abstract":"The inevitable trend of hardware specialization drives an increasing use of custom data formats in processing sparse workloads, which are typically memory-bound. These formats facilitate the automated generation of target-aware data layouts to improve memory access latency and bandwidth utilization. However, existing sparse tensor programming models and compilers offer little or no support for productively customizing the sparse formats. Moreover, since these frameworks adopt an attribute-based approach for format abstraction, they cannot easily be extended to support general format customization. To overcome this deficiency, we propose UniSparse, an intermediate language that provides a unified abstraction for representing and customizing sparse formats. We also develop a compiler leveraging the MLIR infrastructure, which supports adaptive customization of formats. We demonstrate the efficacy of our approach through experiments running commonly-used sparse linear algebra operations with hybrid formats on multiple different hardware targets, including an Intel CPU, an NVIDIA GPU, and a simulated processing-in-memory (PIM) device.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"153-156"},"PeriodicalIF":2.3,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory XLA-NDP:用于近数据处理存储器上的深度学习模型训练的高效调度和代码生成
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-23 DOI: 10.1109/LCA.2023.3261136
Jueon Park;Hyojin Sung
Deep learning (DL) model training must address the memory bottleneck to continue scaling. Processing-in-memory approaches can be a viable solution as they move computations near or into the memory, reducing substantial data movement. However, to deploy applications on such hardware, end-to-end software support is crucial for efficient computation mapping and scheduling as well as extensible code generation, but no consideration has been made for DL training workloads. In this paper, we propose XLA-NDP, a compiler and runtime solution for NDPX, a near-data processing (NDP) architecture integrated with an existing DL training framework. XLA-NDP offloads NDPX kernels and schedules them to overlap with GPU kernels to maximize parallelism based on GPU and NDPX costs, while providing a template-based code generator with low-level optimizations. The experiments showed that XLA-NDP provides up to a 41% speedup (24% on average) over the GPU baseline for four DL model training.
深度学习(DL)模型训练必须解决内存瓶颈才能继续扩展。内存处理方法是一种可行的解决方案,因为它们将计算移到内存附近或内存中,从而减少了大量的数据移动。然而,要在这样的硬件上部署应用程序,端到端的软件支持对于有效的计算映射和调度以及可扩展的代码生成至关重要,但是没有考虑DL训练工作负载。在本文中,我们提出了XLA-NDP, NDPX的编译器和运行时解决方案,NDPX是一种与现有DL训练框架集成的近数据处理(NDP)架构。XLA-NDP卸载NDPX内核并调度它们与GPU内核重叠,以最大限度地提高基于GPU和NDPX成本的并行性,同时提供基于模板的代码生成器,并进行低级优化。实验表明,XLA-NDP在4个深度学习模型训练的GPU基线上提供了高达41%的加速(平均24%)。
{"title":"XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory","authors":"Jueon Park;Hyojin Sung","doi":"10.1109/LCA.2023.3261136","DOIUrl":"10.1109/LCA.2023.3261136","url":null,"abstract":"Deep learning (DL) model training must address the memory bottleneck to continue scaling. Processing-in-memory approaches can be a viable solution as they move computations near or into the memory, reducing substantial data movement. However, to deploy applications on such hardware, end-to-end software support is crucial for efficient computation mapping and scheduling as well as extensible code generation, but no consideration has been made for DL training workloads. In this paper, we propose XLA-NDP, a compiler and runtime solution for NDPX, a near-data processing (NDP) architecture integrated with an existing DL training framework. XLA-NDP offloads NDPX kernels and schedules them to overlap with GPU kernels to maximize parallelism based on GPU and NDPX costs, while providing a template-based code generator with low-level optimizations. The experiments showed that XLA-NDP provides up to a 41% speedup (24% on average) over the GPU baseline for four DL model training.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"61-64"},"PeriodicalIF":2.3,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42002591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Improved Power Management in Cloud GPUs 改进云gpu的电源管理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-22 DOI: 10.1109/LCA.2023.3278652
Pratyush Patel;Zibo Gong;Syeda Rizvi;Esha Choukse;Pulkit Misra;Thomas Anderson;Akshitha Sriraman
As modern server GPUs are increasingly power intensive, better power management mechanisms can significantly reduce the power consumption, capital costs, and carbon emissions in large cloud datacenters. This letter uses diverse datacenter workloads to study the power management capabilities of modern GPUs. We find that current GPU management mechanisms have limited compatibility and monitoring support under cloud virtualization. They have sub-optimal, imprecise, and non-intuitive implementations of Dynamic Voltage and Frequency Scaling (DVFS) and power capping. Consequently, efficient GPU power management is not widely deployed in clouds today. To address these issues, we make actionable recommendations for GPU vendors and researchers.
随着现代服务器gpu的功耗越来越高,更好的电源管理机制可以显著降低大型云数据中心的功耗、资本成本和碳排放。本文使用不同的数据中心工作负载来研究现代gpu的电源管理功能。我们发现当前的GPU管理机制在云虚拟化下的兼容性和监控支持有限。它们在动态电压和频率缩放(DVFS)和功率封顶方面具有次优、不精确和非直观的实现。因此,高效的GPU电源管理在今天的云中并没有广泛部署。为了解决这些问题,我们为GPU供应商和研究人员提出了可行的建议。
{"title":"Towards Improved Power Management in Cloud GPUs","authors":"Pratyush Patel;Zibo Gong;Syeda Rizvi;Esha Choukse;Pulkit Misra;Thomas Anderson;Akshitha Sriraman","doi":"10.1109/LCA.2023.3278652","DOIUrl":"10.1109/LCA.2023.3278652","url":null,"abstract":"As modern server GPUs are increasingly power intensive, better power management mechanisms can significantly reduce the power consumption, capital costs, and carbon emissions in large cloud datacenters. This letter uses diverse datacenter workloads to study the power management capabilities of modern GPUs. We find that current GPU management mechanisms have limited compatibility and monitoring support under cloud virtualization. They have sub-optimal, imprecise, and non-intuitive implementations of Dynamic Voltage and Frequency Scaling (DVFS) and power capping. Consequently, efficient GPU power management is not widely deployed in clouds today. To address these issues, we make actionable recommendations for GPU vendors and researchers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"141-144"},"PeriodicalIF":2.3,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48510260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The Jaseci Programming Paradigm and Runtime Stack: Building Scale-Out Production Applications Easy and Fast Jaseci编程范式和运行时堆栈:构建横向扩展的生产应用程序容易和快速
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-18 DOI: 10.1109/LCA.2023.3274038
Jason Mars;Yiping Kang;Roland Daynauth;Baichuan Li;Ashish Mahendra;Krisztian Flautner;Lingjia Tang
Today's production scale-out applications include many sub-application components, such as storage backends, logging infrastructure and AI models. These components have drastically different characteristics, are required to work in collaboration, and interface with each other as microservices. This leads to increasingly high complexity in developing, optimizing, configuring, and deploying scale-out applications, raising the barrier to entry for most individuals and small teams. We developed a novel co-designed runtime system, Jaseci, and programming language, Jac, which aims to reduce this complexity. The key design principle throughout Jaseci's design is to raise the level of abstraction by moving as much of the scale-out data management, microservice componentization, and live update complexity into the runtime stack to be automated and optimized automatically. We use real-world AI applications to demonstrate Jaseci's benefit for application performance and developer productivity.
今天的生产向外扩展应用程序包括许多子应用程序组件,例如存储后端、日志基础设施和AI模型。这些组件具有截然不同的特征,需要协同工作,并作为微服务相互连接。这导致开发、优化、配置和部署横向扩展应用程序的复杂性日益增加,提高了大多数个人和小型团队的进入门槛。我们开发了一种新的共同设计的运行时系统Jaseci和编程语言Jac,旨在降低这种复杂性。贯穿Jaseci设计的关键设计原则是通过将尽可能多的横向扩展数据管理、微服务组件化和实时更新复杂性转移到运行时堆栈中以实现自动化和自动优化,从而提高抽象级别。我们使用真实的AI应用程序来演示Jaseci对应用程序性能和开发人员生产力的好处。
{"title":"The Jaseci Programming Paradigm and Runtime Stack: Building Scale-Out Production Applications Easy and Fast","authors":"Jason Mars;Yiping Kang;Roland Daynauth;Baichuan Li;Ashish Mahendra;Krisztian Flautner;Lingjia Tang","doi":"10.1109/LCA.2023.3274038","DOIUrl":"10.1109/LCA.2023.3274038","url":null,"abstract":"Today's production scale-out applications include many sub-application components, such as storage backends, logging infrastructure and AI models. These components have drastically different characteristics, are required to work in collaboration, and interface with each other as microservices. This leads to increasingly high complexity in developing, optimizing, configuring, and deploying scale-out applications, raising the barrier to entry for most individuals and small teams. We developed a novel co-designed runtime system, \u0000<italic>Jaseci</i>\u0000, and programming language, \u0000<italic>Jac</i>\u0000, which aims to reduce this complexity. The key design principle throughout Jaseci's design is to raise the level of abstraction by moving as much of the scale-out data management, microservice componentization, and live update complexity into the runtime stack to be automated and optimized automatically. We use real-world AI applications to demonstrate Jaseci's benefit for application performance and developer productivity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"101-104"},"PeriodicalIF":2.3,"publicationDate":"2023-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44390952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mitigating Timing-Based NoC Side-Channel Attacks With LLC Remapping 利用LLC重映射缓解基于定时的NoC侧信道攻击
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-16 DOI: 10.1109/LCA.2023.3276709
Anurag Kar;Xueyang Liu;Yonghae Kim;Gururaj Saileshwar;Hyesoon Kim;Tushar Krishna
Recent CPU microarchitectural attacks utilize contention over the NoC to mount covert and side-channel attacks on multicore CPUs and leak information from victim applications. We propose NoIR, a dynamic LLC slice selection mechanism using slice remapping to obfuscate interconnect contention patterns. NoIR reduces contention variance by 92.18% and mean IPC degradation due to cache invalidation is limited to 7.38% for SPEC CPU 2017 benchmarks for a 1000-access threshold. While previous defenses focused on redesigning the NoC and routing algorithms, we show that a top-down system-level approach can significantly raise the bar for a NoC security vulnerability with minimal modifications to the NoC hardware.
最近的CPU微体系结构攻击利用对NoC的争用对多核CPU发起隐蔽和侧通道攻击,并从受害应用程序泄漏信息。我们提出了NoIR,这是一种动态LLC片选择机制,使用片重映射来模糊互连争用模式。对于1000访问阈值的SPEC CPU 2017基准测试,NoIR将争用方差降低了92.18%,并且由于缓存无效导致的平均IPC降级限制在7.38%。虽然之前的防御侧重于重新设计NoC和路由算法,但我们表明,自上而下的系统级方法可以在对NoC硬件进行最小修改的情况下显著提高NoC安全漏洞的门槛。
{"title":"Mitigating Timing-Based NoC Side-Channel Attacks With LLC Remapping","authors":"Anurag Kar;Xueyang Liu;Yonghae Kim;Gururaj Saileshwar;Hyesoon Kim;Tushar Krishna","doi":"10.1109/LCA.2023.3276709","DOIUrl":"10.1109/LCA.2023.3276709","url":null,"abstract":"Recent CPU microarchitectural attacks utilize contention over the NoC to mount covert and side-channel attacks on multicore CPUs and leak information from victim applications. We propose NoIR, a dynamic LLC slice selection mechanism using slice remapping to obfuscate interconnect contention patterns. NoIR reduces contention variance by 92.18% and mean IPC degradation due to cache invalidation is limited to 7.38% for SPEC CPU 2017 benchmarks for a 1000-access threshold. While previous defenses focused on redesigning the NoC and routing algorithms, we show that a top-down system-level approach can significantly raise the bar for a NoC security vulnerability with minimal modifications to the NoC hardware.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"53-56"},"PeriodicalIF":2.3,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48939213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RouteReplies: Alleviating Long Latency in Many-Chip-Module GPUs RouteReplies:缓解多芯片模块GPU的长延迟
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-13 DOI: 10.1109/LCA.2023.3255555
Xia Zhao;Guangda Zhang;Lu Wang;Yangmei Li;Yongjun Zhang
GPU chip module count is expected to keep increasing to meet the strong scaling demands of parallel applications. In many-chip-module GPUs, memory access latency seriously limits the performance since the transferring latency between different GPU modules is very high, which cannot be easily hidden by switching between different ready threads. To handle this problem, we propose RouteReplies, which enables a GPU module to fetch data from other GPU modules in the routing path. Leveraging the data locality between different GPU modules, RouteReplies significantly reduces the memory access latency since the memory request does not need to fetch data from the faraway memory partition. For a set of applications exhibiting varying degrees of inter-module locality, RouteReplies reduces memory access latency and increases performance by 54.8% on average (up to 364.8%).
GPU芯片模块数量预计将继续增加,以满足并行应用程序的强大扩展需求。在许多芯片模块GPU中,内存访问延迟严重限制了性能,因为不同GPU模块之间的传输延迟非常高,无法通过在不同的就绪线程之间切换来轻松隐藏。为了解决这个问题,我们提出了RouteReplies,它使GPU模块能够从路由路径中的其他GPU模块获取数据。利用不同GPU模块之间的数据局部性,RouteReplies显著降低了内存访问延迟,因为内存请求不需要从遥远的内存分区获取数据。对于一组表现出不同程度的模块间局部性的应用程序,RouteReplies可减少内存访问延迟,并将性能平均提高54.8%(最高可达364.8%)。
{"title":"RouteReplies: Alleviating Long Latency in Many-Chip-Module GPUs","authors":"Xia Zhao;Guangda Zhang;Lu Wang;Yangmei Li;Yongjun Zhang","doi":"10.1109/LCA.2023.3255555","DOIUrl":"10.1109/LCA.2023.3255555","url":null,"abstract":"GPU chip module count is expected to keep increasing to meet the strong scaling demands of parallel applications. In many-chip-module GPUs, memory access latency seriously limits the performance since the transferring latency between different GPU modules is very high, which cannot be easily hidden by switching between different ready threads. To handle this problem, we propose RouteReplies, which enables a GPU module to fetch data from other GPU modules in the routing path. Leveraging the data locality between different GPU modules, RouteReplies significantly reduces the memory access latency since the memory request does not need to fetch data from the faraway memory partition. For a set of applications exhibiting varying degrees of inter-module locality, RouteReplies reduces memory access latency and increases performance by 54.8% on average (up to 364.8%).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"29-32"},"PeriodicalIF":2.3,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45547459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1