Proceedings of the 49th Annual International Symposium on Computer Architecture最新文献_第5页

X-cache: a modular architecture for domain-specific caches X-cache:特定领域缓存的模块化架构

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527380

A. Sedaghati, Milad Hakimi, Reza Hojabr, Arrvindh Shriraman

With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 × and remains competitive with hardwired DSAs (even 50% improvement in one case). We demonstrate that meta-tags save 26--79% energy compared to address-tags. In X-Cache, meta-tags consume 1.5--6.5% of data RAM energy and the programmable microcode adds a further 7%.

随着Dennard缩放法的终结，架构师开始转向特定于领域的加速器(dsa)。最先进的dsa使用稀疏数据[37]和间接索引数据结构[18,30]。它们引入了非仿射和动态内存访问[7,35]，并且需要特定于域的缓存。不幸的是，缓存控制器因难以架构而臭名昭著;领域专门化使问题复杂化。DSA缓存需要支持自定义标记、数据结构遍历、多次重新填充和预加载。以前的dsa包括临时缓存结构，不实现缓存控制器。我们提出X-Cache，这是一种用于dsa的可重用缓存方式。我们将开源生成RTL和编程X-Cache的工具链。这里有三个关键思想:i) dsa特定的标签(Meta-tag):设计人员可以使用dsa元数据中的任何字段组合作为标签。元标记消除了遍历元数据并将其转换为全局地址的开销。这节省了能源，并改善了负载使用延迟。ii) dsa可编程行走器(X-Actions):我们发现一组通用的微码动作可用于实现dsa特定的行走、数据块和标签管理。我们开发了一个可编程的微码引擎，可以有效地实现数据编排。iii) dsa可移植控制器(x -例程):我们使用可移植抽象，协程，让设计师表达行走和编排。协程捕获块级并行性，保持轻量级，并最大限度地减少控制器占用。我们为四种不同的DSA家族创建缓存:Sparse GEMM [35,37]， GraphPulse [30]， DASX[22]和Widx[18]。X-Cache的性能比基于地址的缓存高出1.7倍，与硬连线dsa相比仍然具有竞争力(在一个案例中甚至提高了50%)。我们证明，与地址标签相比，元标签节省了26- 79%的能源。在X-Cache中，元标签消耗1.5- 6.5%的数据RAM能量，可编程微码又增加了7%。

{"title":"X-cache: a modular architecture for domain-specific caches","authors":"A. Sedaghati, Milad Hakimi, Reza Hojabr, Arrvindh Shriraman","doi":"10.1145/3470496.3527380","DOIUrl":"https://doi.org/10.1145/3470496.3527380","url":null,"abstract":"With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 × and remains competitive with hardwired DSAs (even 50% improvement in one case). We demonstrate that meta-tags save 26--79% energy compared to address-tags. In X-Cache, meta-tags consume 1.5--6.5% of data RAM energy and the programmable microcode adds a further 7%.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116840662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Lukewarm serverless functions: characterization and optimization 无服务器功能:表征和优化

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527390

David Schall, Artemiy Margaritov, Dmitrii Ustiugov, Andreas Sandberg, Boris Grot

Serverless computing has emerged as a widely-used paradigm for running services in the cloud. In serverless, developers organize their applications as a set of functions, which are invoked on-demand in response to events, such as an HTTP request. To avoid long start-up delays of launching a new function instance, cloud providers tend to keep recently-triggered instances idle (or warm) for some time after the most recent invocation in anticipation of future invocations. Thus, at any given moment on a server, there may be thousands of warm instances of various functions whose executions are interleaved in time based on incoming invocations. This paper observes that (1) there is a high degree of interleaving among warm instances on a given server; (2) the individual warm functions are invoked relatively infrequently, often at the granularity of seconds or minutes; and (3) many function invocations complete within a few milliseconds. Interleaved execution of rarely invoked functions on a server leads to thrashing of each function's microarchitectural state between invocations. Meanwhile, the short execution time of a function impedes amortization of the warm-up latency of the cache hierarchy, causing a 31--114% increase in CPI compared to execution with warm microarchitectural state. We identify on-chip misses for instructions as a major contributor to the performance loss. In response we propose Jukebox, a record-and-replay instruction prefetcher specifically designed for reducing the start-up latency of warm function instances. Jukebox requires just 32KB of metadata per function instance and boosts performance by an average of 18.7% for a wide range of functions, which translates into a corresponding throughput improvement.

无服务器计算已经成为在云中运行服务的一种广泛使用的范例。在无服务器中，开发人员将其应用程序组织为一组函数，这些函数在响应事件(如HTTP请求)时按需调用。为了避免启动新功能实例的长时间启动延迟，云提供商倾向于在最近的调用之后将最近触发的实例保持空闲(或热)一段时间，以预测未来的调用。因此，在服务器上的任何给定时刻，可能有数千个各种函数的热实例，它们的执行根据传入调用在时间上交错进行。本文观察到:(1)给定服务器上的热实例之间存在高度交错;(2)单个暖函数调用频率相对较低，通常以秒或分为粒度;(3)许多函数调用在几毫秒内完成。在服务器上交错执行很少调用的函数会导致调用之间每个函数的微体系结构状态的波动。与此同时，函数的短执行时间阻碍了缓存层次结构的预热延迟的摊销，导致CPI比在热微架构状态下的执行增加了31- 114%。我们认为芯片上的指令缺失是造成性能损失的主要原因。作为回应，我们提出了Jukebox，这是一个专门为减少热函数实例的启动延迟而设计的记录和重播指令预取器。Jukebox每个函数实例只需要32KB的元数据，对于各种功能，它的性能平均提高了18.7%，这转化为相应的吞吐量提高。

{"title":"Lukewarm serverless functions: characterization and optimization","authors":"David Schall, Artemiy Margaritov, Dmitrii Ustiugov, Andreas Sandberg, Boris Grot","doi":"10.1145/3470496.3527390","DOIUrl":"https://doi.org/10.1145/3470496.3527390","url":null,"abstract":"Serverless computing has emerged as a widely-used paradigm for running services in the cloud. In serverless, developers organize their applications as a set of functions, which are invoked on-demand in response to events, such as an HTTP request. To avoid long start-up delays of launching a new function instance, cloud providers tend to keep recently-triggered instances idle (or warm) for some time after the most recent invocation in anticipation of future invocations. Thus, at any given moment on a server, there may be thousands of warm instances of various functions whose executions are interleaved in time based on incoming invocations. This paper observes that (1) there is a high degree of interleaving among warm instances on a given server; (2) the individual warm functions are invoked relatively infrequently, often at the granularity of seconds or minutes; and (3) many function invocations complete within a few milliseconds. Interleaved execution of rarely invoked functions on a server leads to thrashing of each function's microarchitectural state between invocations. Meanwhile, the short execution time of a function impedes amortization of the warm-up latency of the cache hierarchy, causing a 31--114% increase in CPI compared to execution with warm microarchitectural state. We identify on-chip misses for instructions as a major contributor to the performance loss. In response we propose Jukebox, a record-and-replay instruction prefetcher specifically designed for reducing the start-up latency of warm function instances. Jukebox requires just 32KB of metadata per function instance and boosts performance by an average of 18.7% for a wide range of functions, which translates into a corresponding throughput improvement.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131657826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network 大规模分布式图神经网络的超大规模fpga即服务架构

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527439

Shuangchen Li, Dimin Niu, Yuhao Wang, Wei Han, Zhe Zhang, Tianchan Guan, Yijin Guan, Heng Liu, Linyong Huang, Zhaoyang Du, Fei Xue, Yuanwei Fang, Hongzhong Zheng, Yuan Xie

Graph neural network (GNN) is a promising emerging application for link prediction, recommendation, etc. Existing hardware innovation is limited to single-machine GNN (SM-GNN), however, the enterprises usually adopt huge graph with large-scale distributed GNN (LSD-GNN) that has to be carried out with distributed in-memory storage. The LSD-GNN is very different from SM-GNN in terms of system architecture demand, workflow and operators, and hence characterizations. In this paper, we first quantitively characterize the LSD-GNN with industrial-grade framework and application, summarize that its challenges lie in graph sampling, including distributed graph access, long latency, and underutilized communication and memory bandwidth. These challenges are missing from previous SM-GNN targeted researches. We then propose a customized hardware architecture to solve the challenges, including a fully pipelined access engine architecture for graph access and sampling, a low-latency and bandwidth-efficient customized memory-over-fabric hardware, and a RISC-V centric control system providing good programma-bility. We implement the proposed architecture with full software support in a 4-card FPGA heterogeneous proof-of-concept (PoC) system. Based on the measurement result from the FPGA PoC, we demonstrate a single FPGA can provide up to 894 vCPU's sampling capability. With the goal of being profitable, programmable, and scalable, we further integrate the architecture to FPGA cloud (FaaS) at hyperscale, along with the industrial software framework. We explicitly explore eight FaaS architectures that carry out the proposed accelerator hardware. We finally conclude that off-the-shelf FaaS.base can already provide 2.47× performance per dollar improvement with our hardware. With architecture optimizations, FaaS.comm-opt with customized FPGA fabrics pushes the benefit to 7.78×, and FaaS.mem-opt with FPGA local DRAM and high-speed links to GPU further unleash the benefit to 12.58×.

图神经网络(GNN)在链接预测、推荐等方面是一个很有前途的新兴应用。现有的硬件创新仅限于单机GNN (SM-GNN)，而企业通常采用庞大的图和大规模分布式GNN (LSD-GNN)，必须使用分布式内存存储来实现。LSD-GNN与SM-GNN在系统架构需求、工作流程和操作人员以及特性方面有很大不同。在本文中，我们首先用工业级框架和应用对LSD-GNN进行了定量表征，总结了它在图采样方面的挑战，包括分布式图访问、长延迟以及未充分利用的通信和内存带宽。这些挑战在以前针对SM-GNN的研究中是缺失的。然后，我们提出了一个定制的硬件架构来解决这些挑战，包括一个用于图形访问和采样的全流水线访问引擎架构，一个低延迟和带宽高效的定制内存结构硬件，以及一个以RISC-V为中心的控制系统，提供良好的可编程性。我们在一个4卡FPGA异构概念验证(PoC)系统中实现了具有完整软件支持的拟议架构。基于FPGA PoC的测量结果，我们证明了单个FPGA可以提供高达894 vCPU的采样能力。为了实现可盈利、可编程和可扩展的目标，我们进一步将该架构集成到超大规模的FPGA云(FaaS)以及工业软件框架中。我们明确地探讨了执行所提议的加速器硬件的八种FaaS架构。我们最后得出结论，现成的FaaS。Base已经可以提供2.47倍的性能与我们的硬件每美元的改进。通过架构优化，带有定制FPGA结构的FaaS.com -opt将优势提高到7.78倍，而FaaS。memo -opt与FPGA本地DRAM和GPU的高速链接进一步释放12.58×的优势。

{"title":"Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network","authors":"Shuangchen Li, Dimin Niu, Yuhao Wang, Wei Han, Zhe Zhang, Tianchan Guan, Yijin Guan, Heng Liu, Linyong Huang, Zhaoyang Du, Fei Xue, Yuanwei Fang, Hongzhong Zheng, Yuan Xie","doi":"10.1145/3470496.3527439","DOIUrl":"https://doi.org/10.1145/3470496.3527439","url":null,"abstract":"Graph neural network (GNN) is a promising emerging application for link prediction, recommendation, etc. Existing hardware innovation is limited to single-machine GNN (SM-GNN), however, the enterprises usually adopt huge graph with large-scale distributed GNN (LSD-GNN) that has to be carried out with distributed in-memory storage. The LSD-GNN is very different from SM-GNN in terms of system architecture demand, workflow and operators, and hence characterizations. In this paper, we first quantitively characterize the LSD-GNN with industrial-grade framework and application, summarize that its challenges lie in graph sampling, including distributed graph access, long latency, and underutilized communication and memory bandwidth. These challenges are missing from previous SM-GNN targeted researches. We then propose a customized hardware architecture to solve the challenges, including a fully pipelined access engine architecture for graph access and sampling, a low-latency and bandwidth-efficient customized memory-over-fabric hardware, and a RISC-V centric control system providing good programma-bility. We implement the proposed architecture with full software support in a 4-card FPGA heterogeneous proof-of-concept (PoC) system. Based on the measurement result from the FPGA PoC, we demonstrate a single FPGA can provide up to 894 vCPU's sampling capability. With the goal of being profitable, programmable, and scalable, we further integrate the architecture to FPGA cloud (FaaS) at hyperscale, along with the industrial software framework. We explicitly explore eight FaaS architectures that carry out the proposed accelerator hardware. We finally conclude that off-the-shelf FaaS.base can already provide 2.47× performance per dollar improvement with our hardware. With architecture optimizations, FaaS.comm-opt with customized FPGA fabrics pushes the benefit to 7.78×, and FaaS.mem-opt with FPGA local DRAM and high-speed links to GPU further unleash the benefit to 12.58×.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130089660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Hydra 九头蛇

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1002/9781118351352.wbve1046

Moinuddin K. Qureshi, Aditya Rohan, Gururaj Saileshwar, Prashant J. Nair

引用次数: 0

Tiny but mighty: designing and realizing scalable latency tolerance for manycore SoCs 微小但强大:设计和实现多核soc的可扩展延迟容忍

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527400

Marcelo Orenes-Vera, Aninda Manocha, Jonathan Balkind, Fei Gao, Juan L. Aragón, D. Wentzlaff, M. Martonosi

Modern computing systems employ significant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network and graph analytic applications where indirect memory accesses (IMAs) challenge the memory hierarchy. Decades of prior art have proposed hardware and software mechanisms to mitigate IMA latency, but they fail to analyze real-chip considerations, especially when used in SoCs and manycores. In this paper, we revisit many of these techniques while taking into account manycore integration and verification. We present the first system implementation of latency tolerance hardware that provides significant speedups without requiring any memory hierarchy or processor tile modifications. This is achieved through a Memory Access Parallel-Load Engine (MAPLE), integrated through the Network-on-Chip (NoC) in a scalable manner. Our hardware-software co-design allows programs to perform long-latency memory accesses asynchronously from the core, avoiding pipeline stalls, and enabling greater memory parallelism (MLP). In April 2021 we taped out a manycore chip that includes tens of MAPLE instances for efficient data supply. MAPLE demonstrates a full RTL implementation of out-of-core latency-mitigation hardware, with virtual memory support and automated compilation targetting it. This paper evaluates MAPLE integrated with a dual-core FPGA prototype running applications with full SMP Linux, and demonstrates geomean speedups of 2.35× and 2.27× over software-based prefetching and decoupling, respectively. Compared to state-of-the-art hardware, it provides geomean speedups of 1.82× and 1.72× over prefetching and decoupling techniques.

现代计算系统采用显著的异构性和专门化，以在可管理的功率下满足性能目标。然而，内存延迟瓶颈仍然存在问题，特别是对于稀疏神经网络和图形分析应用程序，其中间接内存访问(ima)挑战内存层次结构。几十年来的现有技术已经提出了硬件和软件机制来减轻IMA延迟，但它们无法分析实际芯片的考虑因素，特别是在soc和多核中使用时。在本文中，我们在考虑多核集成和验证的同时重新审视了其中的许多技术。我们提出了延迟容忍硬件的第一个系统实现，它提供了显著的加速，而不需要任何内存层次结构或处理器块修改。这是通过内存访问并行负载引擎(MAPLE)实现的，该引擎以可扩展的方式通过片上网络(NoC)集成。我们的硬件软件协同设计允许程序从核心异步执行长延迟内存访问，避免管道停滞，并实现更高的内存并行性(MLP)。在2021年4月，我们录制了一个多核芯片，其中包括数十个MAPLE实例，用于有效的数据供应。MAPLE演示了核外延迟缓解硬件的完整RTL实现，具有虚拟内存支持和针对它的自动编译。本文评估了与双核FPGA原型集成的MAPLE在全SMP Linux下运行的应用程序，并演示了分别比基于软件的预取和解耦提高2.35倍和2.27倍的几何速度。与最先进的硬件相比，它比预取和解耦技术提供了1.82倍和1.72倍的几何加速。

{"title":"Tiny but mighty: designing and realizing scalable latency tolerance for manycore SoCs","authors":"Marcelo Orenes-Vera, Aninda Manocha, Jonathan Balkind, Fei Gao, Juan L. Aragón, D. Wentzlaff, M. Martonosi","doi":"10.1145/3470496.3527400","DOIUrl":"https://doi.org/10.1145/3470496.3527400","url":null,"abstract":"Modern computing systems employ significant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network and graph analytic applications where indirect memory accesses (IMAs) challenge the memory hierarchy. Decades of prior art have proposed hardware and software mechanisms to mitigate IMA latency, but they fail to analyze real-chip considerations, especially when used in SoCs and manycores. In this paper, we revisit many of these techniques while taking into account manycore integration and verification. We present the first system implementation of latency tolerance hardware that provides significant speedups without requiring any memory hierarchy or processor tile modifications. This is achieved through a Memory Access Parallel-Load Engine (MAPLE), integrated through the Network-on-Chip (NoC) in a scalable manner. Our hardware-software co-design allows programs to perform long-latency memory accesses asynchronously from the core, avoiding pipeline stalls, and enabling greater memory parallelism (MLP). In April 2021 we taped out a manycore chip that includes tens of MAPLE instances for efficient data supply. MAPLE demonstrates a full RTL implementation of out-of-core latency-mitigation hardware, with virtual memory support and automated compilation targetting it. This paper evaluates MAPLE integrated with a dual-core FPGA prototype running applications with full SMP Linux, and demonstrates geomean speedups of 2.35× and 2.27× over software-based prefetching and decoupling, respectively. Compared to state-of-the-art hardware, it provides geomean speedups of 1.82× and 1.72× over prefetching and decoupling techniques.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133687902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

SNS's not a synthesizer: a deep-learning-based synthesis predictor SNS不是一个综合器:一个基于深度学习的综合预测器

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527444

Ceyu Xu, Chris Kjellqvist, Lisa Wu Wills

The number of transistors that can fit on one monolithic chip has reached billions to tens of billions in this decade thanks to Moore's Law. With the advancement of every technology generation, the transistor counts per chip grow at a pace that brings about exponential increase in design time, including the synthesis process used to perform design space explorations. Such a long delay in obtaining synthesis results hinders an efficient chip development process, significantly impacting time-to-market. In addition, these large-scale integrated circuits tend to have larger and higher-dimension design spaces to explore, making it prohibitively expensive to obtain physical characteristics of all possible designs using traditional synthesis tools. In this work, we propose a deep-learning-based synthesis predictor called SNS (SNS's not a Synthesizer), that predicts the area, power, and timing physical characteristics of a broad range of designs at two to three orders of magnitude faster than the Synopsys Design Compiler while providing on average a 0.4998 RRSE (root relative square error). We further evaluate SNS via two representative case studies, a general-purpose out-of-order CPU case study using RISC-V Boom open-source design and an accelerator case study using an in-house Chisel implementation of DianNao, to demonstrate the capabilities and validity of SNS.

由于摩尔定律，在这十年里，一块单片芯片上可以容纳的晶体管数量已经达到了数十亿到数百亿。随着每一代技术的进步，每个芯片的晶体管数量以指数级的速度增长，带来了设计时间的增长，包括用于执行设计空间探索的合成过程。获得合成结果的如此长时间延迟阻碍了高效的芯片开发过程，严重影响了上市时间。此外，这些大规模集成电路往往具有更大、更高维度的设计空间来探索，这使得使用传统合成工具获得所有可能设计的物理特性变得非常昂贵。在这项工作中，我们提出了一种基于深度学习的合成预测器，称为SNS (SNS不是合成器)，它以比Synopsys设计编译器快两到三个数量级的速度预测各种设计的面积、功率和时序物理特性，同时平均提供0.4998 RRSE(根相对平方误差)。我们通过两个代表性的案例研究来进一步评估SNS，一个是使用RISC-V Boom开源设计的通用无序CPU案例研究，另一个是使用内部Chisel实现的DianNao加速器案例研究，以展示SNS的能力和有效性。

{"title":"SNS's not a synthesizer: a deep-learning-based synthesis predictor","authors":"Ceyu Xu, Chris Kjellqvist, Lisa Wu Wills","doi":"10.1145/3470496.3527444","DOIUrl":"https://doi.org/10.1145/3470496.3527444","url":null,"abstract":"The number of transistors that can fit on one monolithic chip has reached billions to tens of billions in this decade thanks to Moore's Law. With the advancement of every technology generation, the transistor counts per chip grow at a pace that brings about exponential increase in design time, including the synthesis process used to perform design space explorations. Such a long delay in obtaining synthesis results hinders an efficient chip development process, significantly impacting time-to-market. In addition, these large-scale integrated circuits tend to have larger and higher-dimension design spaces to explore, making it prohibitively expensive to obtain physical characteristics of all possible designs using traditional synthesis tools. In this work, we propose a deep-learning-based synthesis predictor called SNS (SNS's not a Synthesizer), that predicts the area, power, and timing physical characteristics of a broad range of designs at two to three orders of magnitude faster than the Synopsys Design Compiler while providing on average a 0.4998 RRSE (root relative square error). We further evaluate SNS via two representative case studies, a general-purpose out-of-order CPU case study using RISC-V Boom open-source design and an accelerator case study using an in-house Chisel implementation of DianNao, to demonstrate the capabilities and validity of SNS.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115403581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Thermometer: profile-guided btb replacement for data center applications 温度计:数据中心应用的配置文件引导btb替代品

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527430

Shixin Song, Tanvir Ahmed Khan, Sara Mahdizadeh-Shahri, Akshitha Sriraman, N. Soundararajan, S. Subramoney, Daniel A. Jiménez, Heiner Litz, Baris Kasikci

Modern processors employ a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP) to avoid frontend stalls in data center applications. However, the large branch footprint of data center applications precipitates frequent Branch Target Buffer (BTB) misses that prohibit FDIP from eliminating more than 40% of all frontend stalls. We find that the state-of-the-art BTB optimization techniques (e.g., BTB prefetching and replacement mechanisms) cannot eliminate these misses due to their inadequate understanding of branch reuse behavior in data center applications. In this paper, we first perform a comprehensive characterization of the branch behavior of data center applications, and determine that identifying optimal BTB replacement decisions requires considering both transient and holistic (i.e., across the entire execution) branch behavior. We then present Thermometer, a novel BTB replacement technique that realizes the holistic branch behavior via a profile-guided analysis. Based on the collected profile, Thermometer generates useful BTB replacement hints that the underlying hardware can leverage. We evaluate Thermometer using 13 widely-used data center applications and demonstrate that it provides an average speedup of 8.7% (0.4%-64.9%) while outperforming the state-of-the-art BTB replacement techniques by 5.6× (on average, the best performing prior work achieves 1.5% speedup). We also demonstrate that Thermometer achieves a performance speedup that is, on average, 83.6% of the speedup achieved by the optimal BTB replacement policy.

现代处理器采用带取定向指令预取(FDIP)的解耦前端来避免数据中心应用程序中的前端停滞。然而，数据中心应用程序的庞大分支空间导致频繁的分支目标缓冲区(BTB)丢失，这使得FDIP无法消除超过40%的前端停机。我们发现最先进的BTB优化技术(例如，BTB预取和替换机制)无法消除这些缺失，因为它们对数据中心应用中的分支重用行为理解不足。在本文中，我们首先对数据中心应用程序的分支行为进行了全面的表征，并确定确定最佳的BTB替换决策需要考虑瞬时和整体(即整个执行过程)分支行为。然后，我们提出了温度计，一种新的BTB替代技术，通过配置文件引导分析实现整体分支行为。根据收集的配置文件，温度计生成有用的BTB替换提示，底层硬件可以利用这些提示。我们使用13个广泛使用的数据中心应用程序对温度计进行了评估，并证明它提供了8.7%(0.4%-64.9%)的平均加速，同时比最先进的BTB替换技术高出5.6倍(平均而言，表现最好的先前工作实现了1.5%的加速)。我们还证明了温度计实现的性能加速平均为最佳BTB替换策略所实现的83.6%。

{"title":"Thermometer: profile-guided btb replacement for data center applications","authors":"Shixin Song, Tanvir Ahmed Khan, Sara Mahdizadeh-Shahri, Akshitha Sriraman, N. Soundararajan, S. Subramoney, Daniel A. Jiménez, Heiner Litz, Baris Kasikci","doi":"10.1145/3470496.3527430","DOIUrl":"https://doi.org/10.1145/3470496.3527430","url":null,"abstract":"Modern processors employ a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP) to avoid frontend stalls in data center applications. However, the large branch footprint of data center applications precipitates frequent Branch Target Buffer (BTB) misses that prohibit FDIP from eliminating more than 40% of all frontend stalls. We find that the state-of-the-art BTB optimization techniques (e.g., BTB prefetching and replacement mechanisms) cannot eliminate these misses due to their inadequate understanding of branch reuse behavior in data center applications. In this paper, we first perform a comprehensive characterization of the branch behavior of data center applications, and determine that identifying optimal BTB replacement decisions requires considering both transient and holistic (i.e., across the entire execution) branch behavior. We then present Thermometer, a novel BTB replacement technique that realizes the holistic branch behavior via a profile-guided analysis. Based on the collected profile, Thermometer generates useful BTB replacement hints that the underlying hardware can leverage. We evaluate Thermometer using 13 widely-used data center applications and demonstrate that it provides an average speedup of 8.7% (0.4%-64.9%) while outperforming the state-of-the-art BTB replacement techniques by 5.6× (on average, the best performing prior work achieves 1.5% speedup). We also demonstrate that Thermometer achieves a performance speedup that is, on average, 83.6% of the speedup achieved by the optimal BTB replacement policy.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116204913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

GCoM

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527384

Jounghoo Lee, Yeonan Ha, Suhyun Lee, Jinyoung Woo, Jinho Lee, Hanhwi Jang, Youngsok Kim

Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior studies have proposed GPU analytical models which capture first-order stall events causing performance degradation; however, the existing analytical models cannot accurately model modern GPUs due to their outdated and highly abstract GPU core microarchitecture assumptions. Therefore, to accurately evaluate the performance of modern GPUs, we need a new GPU analytical model which accurately captures the stall events incurred by the significant changes in the core microarchitectures of modern GPUs. We propose GCoM, an accurate GPU analytical model which faithfully captures the key core-side stall events of modern GPUs. Through detailed microarchitecture-driven GPU core modeling, GCoM accurately models modern GPUs by revealing the following key core-side stalls overlooked by the existing GPU analytical models. First, GCoM identifies the compute structural stall events caused by the limited per-sub-core functional units. Second, GCoM exposes the memory structural stalls due to the limited banks and shared nature of per-core L1 data caches. Third, GCoM correctly predicts the memory data stalls induced by the sectored L1 data caches which split a cache line into a set of sectors sharing the same tag. Fourth, GCoM captures the idle stalls incurred by the inter- and intra-core load imbalances. Our experiments using an NVIDIA RTX 2060 configuration show that GCoM greatly improves the modeling accuracy by achieving a mean absolute error of 10.0% against Accel-Sim cycle-level simulator, whereas the state-of-the-art GPU analytical model achieves a mean absolute error of 44.9%.

{"title":"GCoM","authors":"Jounghoo Lee, Yeonan Ha, Suhyun Lee, Jinyoung Woo, Jinho Lee, Hanhwi Jang, Youngsok Kim","doi":"10.1145/3470496.3527384","DOIUrl":"https://doi.org/10.1145/3470496.3527384","url":null,"abstract":"Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior studies have proposed GPU analytical models which capture first-order stall events causing performance degradation; however, the existing analytical models cannot accurately model modern GPUs due to their outdated and highly abstract GPU core microarchitecture assumptions. Therefore, to accurately evaluate the performance of modern GPUs, we need a new GPU analytical model which accurately captures the stall events incurred by the significant changes in the core microarchitectures of modern GPUs. We propose GCoM, an accurate GPU analytical model which faithfully captures the key core-side stall events of modern GPUs. Through detailed microarchitecture-driven GPU core modeling, GCoM accurately models modern GPUs by revealing the following key core-side stalls overlooked by the existing GPU analytical models. First, GCoM identifies the compute structural stall events caused by the limited per-sub-core functional units. Second, GCoM exposes the memory structural stalls due to the limited banks and shared nature of per-core L1 data caches. Third, GCoM correctly predicts the memory data stalls induced by the sectored L1 data caches which split a cache line into a set of sectors sharing the same tag. Fourth, GCoM captures the idle stalls incurred by the inter- and intra-core load imbalances. Our experiments using an NVIDIA RTX 2060 configuration show that GCoM greatly improves the modeling accuracy by achieving a mean absolute error of 10.0% against Accel-Sim cycle-level simulator, whereas the state-of-the-art GPU analytical model achieves a mean absolute error of 44.9%.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125921708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Increasing ising machine capacity with multi-chip architectures 通过多芯片架构增加机器容量

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527414

Anshujit Sharma, R. Afoakwa, Z. Ignjatovic, Michael C. Huang

Nature has inspired a lot of problem solving techniques over the decades. More recently, researchers have increasingly turned to harnessing nature to solve problems directly. Ising machines are a good example and there are numerous research prototypes as well as many design concepts. They can map a family of NP-complete problems and derive competitive solutions at speeds much greater than conventional algorithms and in some cases, at a fraction of the energy cost of a von Neumann computer. However, physical Ising machines are often fixed in its problem solving capacity. Without any support, a bigger problem cannot be solved at all. With a simple divide-and-conquer strategy, it turns out, the advantage of using an Ising machine quickly diminishes. It is therefore desirable for Ising machines to have a scalable architecture where multiple instances can collaborate to solve a bigger problem. We then discuss scalable architecture design issues which lead to a multiprocessor Ising machine architecture. Experimental analyses show that our proposed architectures allow an Ising machine to scale in capacity and maintain its significant performance advantage (about 2200x speedup over a state-of-the-art computational substrate). In the case of communication bandwidth-limited systems, our proposed optimizations in supporting batch mode operation can cut down communication demand by about 4--5x without a significant impact on solution quality.

在过去的几十年里，大自然激发了许多解决问题的技术。最近，研究人员越来越多地转向利用自然直接解决问题。伊辛机器就是一个很好的例子，有许多研究原型和许多设计概念。它们可以映射一组np完全问题，并以比传统算法快得多的速度推导出竞争性的解决方案，在某些情况下，其能耗仅为冯·诺伊曼计算机的一小部分。然而，物理的伊辛机器在解决问题的能力上往往是固定的。没有任何支持，更大的问题根本无法解决。事实证明，采用简单的分而治之策略，使用伊辛机器的优势很快就会减弱。因此，我们希望Ising机器具有可伸缩的体系结构，其中多个实例可以协作解决更大的问题。然后我们讨论了可扩展的架构设计问题，这导致了多处理器的Ising机器架构。实验分析表明，我们提出的架构允许Ising机器扩展容量并保持其显着的性能优势(比最先进的计算基板加速约2200x)。在通信带宽有限的系统中，我们提出的支持批处理模式操作的优化可以将通信需求减少约4- 5倍，而不会对解决方案质量产生重大影响。

{"title":"Increasing ising machine capacity with multi-chip architectures","authors":"Anshujit Sharma, R. Afoakwa, Z. Ignjatovic, Michael C. Huang","doi":"10.1145/3470496.3527414","DOIUrl":"https://doi.org/10.1145/3470496.3527414","url":null,"abstract":"Nature has inspired a lot of problem solving techniques over the decades. More recently, researchers have increasingly turned to harnessing nature to solve problems directly. Ising machines are a good example and there are numerous research prototypes as well as many design concepts. They can map a family of NP-complete problems and derive competitive solutions at speeds much greater than conventional algorithms and in some cases, at a fraction of the energy cost of a von Neumann computer. However, physical Ising machines are often fixed in its problem solving capacity. Without any support, a bigger problem cannot be solved at all. With a simple divide-and-conquer strategy, it turns out, the advantage of using an Ising machine quickly diminishes. It is therefore desirable for Ising machines to have a scalable architecture where multiple instances can collaborate to solve a bigger problem. We then discuss scalable architecture design issues which lead to a multiprocessor Ising machine architecture. Experimental analyses show that our proposed architectures allow an Ising machine to scale in capacity and maintain its significant performance advantage (about 2200x speedup over a state-of-the-art computational substrate). In the case of communication bandwidth-limited systems, our proposed optimizations in supporting batch mode operation can cut down communication demand by about 4--5x without a significant impact on solution quality.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"61 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123300135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Gearbox 齿轮箱

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527402

Marzieh Lenjani, A. Ahmed, M. Stan, K. Skadron

Processing-in-memory (PIM) minimizes data movement overheads by placing processing units near each memory segment. Recent PIMs employ processing units with a SIMD architecture. However, kernels with random accesses, such as sparse-matrix-dense-vector (SpMV) and sparse-matrix-sparse-vector (SpMSpV), cannot effectively exploit the parallelism of SIMD units because SIMD's ALUs remain idle until all the operands are collected from local memory segments (memory segment attached to the processing unit) or remote memory segments (other segments of the memory). For SpMV and SpMSpV, properly partitioning the matrix and the vector among the memory segments is also very important. Partitioning determines (i) how much processing load will be assigned to each processing unit and (ii) how much communication is required among the processing units. In this paper, first, we propose a highly parallel architecture that can exploit the available parallelism even in the presence of random accesses. Second, we observed that, in SpMV and SpMSpV, most of the remote accesses become remote accumulations with the right choice of algorithm and partitioning. The remote accumulations could be offloaded to be performed by processing units next to the destination memory segments, eliminating idle time due to remote accesses. Accordingly, we introduce a dispatching mechanism for remote accumulation offloading. Third, we propose Hybrid partitioning and associated hardware support. Our partitioning technique enables (i) replacing remote read accesses with broadcasting (for only a small portion of data that will be read by all processing units), (ii) reducing the number of remote accumulations, and (iii) balancing the load. Our proposed method, Gearbox, with just one memory stack, delivers on average (up to) 15.73X (52X) speedup over a server-class GPU, NVIDIA P100, with three stacks of HBM2 memory.

{"title":"Gearbox","authors":"Marzieh Lenjani, A. Ahmed, M. Stan, K. Skadron","doi":"10.1145/3470496.3527402","DOIUrl":"https://doi.org/10.1145/3470496.3527402","url":null,"abstract":"Processing-in-memory (PIM) minimizes data movement overheads by placing processing units near each memory segment. Recent PIMs employ processing units with a SIMD architecture. However, kernels with random accesses, such as sparse-matrix-dense-vector (SpMV) and sparse-matrix-sparse-vector (SpMSpV), cannot effectively exploit the parallelism of SIMD units because SIMD's ALUs remain idle until all the operands are collected from local memory segments (memory segment attached to the processing unit) or remote memory segments (other segments of the memory). For SpMV and SpMSpV, properly partitioning the matrix and the vector among the memory segments is also very important. Partitioning determines (i) how much processing load will be assigned to each processing unit and (ii) how much communication is required among the processing units. In this paper, first, we propose a highly parallel architecture that can exploit the available parallelism even in the presence of random accesses. Second, we observed that, in SpMV and SpMSpV, most of the remote accesses become remote accumulations with the right choice of algorithm and partitioning. The remote accumulations could be offloaded to be performed by processing units next to the destination memory segments, eliminating idle time due to remote accesses. Accordingly, we introduce a dispatching mechanism for remote accumulation offloading. Third, we propose Hybrid partitioning and associated hardware support. Our partitioning technique enables (i) replacing remote read accesses with broadcasting (for only a small portion of data that will be read by all processing units), (ii) reducing the number of remote accumulations, and (iii) balancing the load. Our proposed method, Gearbox, with just one memory stack, delivers on average (up to) 15.73X (52X) speedup over a server-class GPU, NVIDIA P100, with three stacks of HBM2 memory.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115871842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7