Proceedings of the 49th Annual International Symposium on Computer Architecture最新文献

英文中文

Geyser 喷泉

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527428

Tirthak Patel, Daniel Silver, Devesh Tiwari

Compared to widely-used superconducting qubits, neutral-atom quantum computing technology promises potentially better scalability and flexible arrangement of qubits to allow higher operation parallelism and more relaxed cooling requirements. The high performance computing (HPC) and architecture community is beginning to design new solutions to take advantage of neutral-atom quantum architectures and overcome its unique challenges. We propose Geyser, the first work to take advantage of the multi-qubit gates natively supported by neutral-atom quantum computers by appropriately mapping quantum circuits to three-qubit-friendly physical arrangement of qubits. Then, Geyser creates multiple logical blocks in the quantum circuit to exploit quantum parallelism and reduce the number of pulses needed to realize physical gates. These circuit blocks elegantly enable Geyser to compose equivalent circuits with three-qubit gates, even when the original program does not have any multi-qubit gates. Our evaluation results show Geyser reduces the number of operation pulses by 25%-90% and improves the algorithm's output fidelity by 25%-60% points across different algorithms.

引用次数: 15

Cascading structured pruning: enabling high data reuse for sparse DNN accelerators 级联结构化剪枝:实现稀疏DNN加速器的高数据重用

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527419

Edward Hanson, Shiyu Li, H. Li, Yiran Chen

Performance and efficiency of running modern Deep Neural Networks (DNNs) are heavily bounded by data movement. To mitigate the data movement bottlenecks, recent DNN inference accelerator designs widely adopt aggressive compression techniques and sparse-skipping mechanisms. These mechanisms avoid transferring or computing with zero-valued weights or activations to save time and energy. However, such sparse-skipping logic involves large input buffers and irregular data access patterns, thus precluding many energy-efficient data reuse opportunities and dataflows. In this work, we propose Cascading Structured Pruning (CSP), a technique that preserves significantly more data reuse opportunities for higher energy efficiency while maintaining comparable performance relative to recent sparse architectures such as SparTen. CSP includes the following two components: At algorithm level, CSP-A induces a predictable sparsity pattern that allows for low-overhead compression of weight data and sequential access to both activation and weight data. At architecture level, CSP-H leverages CSP-A's induced sparsity pattern with a novel dataflow to access unique activation data only once, thus removing the demand for large input buffers. Each CSP-H processing element (PE) employs a novel accumulation buffer design and a counter-based sparse-skipping mechanism to support the dataflow with minimum controller overhead. We verify our approach on several representative models. Our simulated results show that CSP achieves on average 15× energy efficiency improvement over SparTen with comparable or superior speedup under most evaluations.

运行现代深度神经网络(dnn)的性能和效率在很大程度上受到数据移动的限制。为了缓解数据移动瓶颈，最近的深度神经网络推理加速器设计广泛采用积极的压缩技术和稀疏跳变机制。这些机制避免了传输或计算零值权重或激活，以节省时间和精力。然而，这种稀疏跳过逻辑涉及大型输入缓冲区和不规则的数据访问模式，从而排除了许多高能效的数据重用机会和数据流。在这项工作中，我们提出了级联结构化修剪(Cascading Structured Pruning, CSP)，这是一种技术，可以保留更多的数据重用机会，从而提高能源效率，同时保持与最近的稀疏架构(如SparTen)相当的性能。CSP包括以下两个组件:在算法级别，CSP- a诱导可预测的稀疏模式，该模式允许低开销的权重数据压缩和对激活和权重数据的顺序访问。在体系结构级别，CSP-H利用CSP-A的诱导稀疏模式和一个新的数据流，只访问一次唯一的激活数据，从而消除了对大输入缓冲区的需求。每个CSP-H处理单元(PE)采用一种新颖的累积缓冲区设计和基于计数器的稀疏跳变机制，以最小的控制器开销支持数据流。我们在几个有代表性的模型上验证了我们的方法。我们的模拟结果表明，在大多数评估下，CSP实现了比SparTen平均15倍的能效改进，并且具有相当或更高的加速。

{"title":"Cascading structured pruning: enabling high data reuse for sparse DNN accelerators","authors":"Edward Hanson, Shiyu Li, H. Li, Yiran Chen","doi":"10.1145/3470496.3527419","DOIUrl":"https://doi.org/10.1145/3470496.3527419","url":null,"abstract":"Performance and efficiency of running modern Deep Neural Networks (DNNs) are heavily bounded by data movement. To mitigate the data movement bottlenecks, recent DNN inference accelerator designs widely adopt aggressive compression techniques and sparse-skipping mechanisms. These mechanisms avoid transferring or computing with zero-valued weights or activations to save time and energy. However, such sparse-skipping logic involves large input buffers and irregular data access patterns, thus precluding many energy-efficient data reuse opportunities and dataflows. In this work, we propose Cascading Structured Pruning (CSP), a technique that preserves significantly more data reuse opportunities for higher energy efficiency while maintaining comparable performance relative to recent sparse architectures such as SparTen. CSP includes the following two components: At algorithm level, CSP-A induces a predictable sparsity pattern that allows for low-overhead compression of weight data and sequential access to both activation and weight data. At architecture level, CSP-H leverages CSP-A's induced sparsity pattern with a novel dataflow to access unique activation data only once, thus removing the demand for large input buffers. Each CSP-H processing element (PE) employs a novel accumulation buffer design and a counter-based sparse-skipping mechanism to support the dataflow with minimum controller overhead. We verify our approach on several representative models. Our simulated results show that CSP achieves on average 15× energy efficiency improvement over SparTen with comparable or superior speedup under most evaluations.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130722359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Anticipating and eliminating redundant computations in accelerated sparse training 加速稀疏训练中冗余计算的预测与消除

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527404

Jonathan Lew, Y. Liu, Wenyi Gong, Negar Goli, R. D. Evans, Tor M. Aamodt

Deep Neural Networks (DNNs) are the state of art in image, speech, and text processing. To address long training times and high energy consumption, custom accelerators can exploit sparsity, that is zero-valued weights, activations, and gradients. Proposed sparse Convolution Neural Network (CNN) accelerators support training with no more than one dynamic sparse convolution input. Among existing accelerator classes, the only ones supporting two-sided dynamic sparsity are outer-product-based accelerators. However, when mapping a convolution onto an outer product, multiplications occur that do not correspond to any valid output. These Redundant Cartesian Products (RCPs) decrease energy efficiency and performance. We observe that in sparse training, up to 90% of computations are RCPs resulting from the convolution of large matrices for weight updates during the backward pass of CNN training. In this work, we design a mechanism, ANT, to anticipate and eliminate RCPs, enabling more efficient sparse training when integrated with an outer-product accelerator. By anticipating over 90% of RCPs, ANT achieves a geometric mean of 3.71× speed up over an SCNN-like accelerator [67] on 90% sparse training using DenseNet-121 [38], ResNet18 [35], VGG16 [73], Wide ResNet (WRN) [85], and ResNet-50 [35], with 4.40× decrease in energy consumption and 0.0017mm2 of additional area. We extend ANT to sparse matrix multiplication, so that the same accelerator can anticipate RCPs in sparse fully-connected layers, transformers, and RNNs.

深度神经网络(dnn)是图像、语音和文本处理领域的最新技术。为了解决长训练时间和高能量消耗的问题，定制加速器可以利用稀疏性，即零值权重、激活和梯度。本文提出的稀疏卷积神经网络(CNN)加速器支持不超过一个动态稀疏卷积输入的训练。在现有的加速器类中，唯一支持双边动态稀疏性的是基于外部产品的加速器。然而，当将卷积映射到外部乘积时，发生的乘法不对应于任何有效的输出。这些冗余笛卡尔积(rcp)降低了能源效率和性能。我们观察到，在稀疏训练中，高达90%的计算是由CNN训练过程中向后传递权重更新的大矩阵卷积产生的rcp。在这项工作中，我们设计了一种机制，ANT，来预测和消除rcp，当与外部产品加速器集成时，实现更有效的稀疏训练。通过预测超过90%的rcp, ANT在使用DenseNet-121[38]、ResNet18[35]、VGG16[73]、Wide ResNet (WRN)[85]和ResNet-50[35]进行90%稀疏训练时，在类似scnn的加速器[67]上实现了3.71倍的几何平均速度提升，能耗降低4.40倍，额外面积减少0.0017mm2。我们将ANT扩展到稀疏矩阵乘法，因此相同的加速器可以预测稀疏全连接层、变压器和rnn中的rcp。

{"title":"Anticipating and eliminating redundant computations in accelerated sparse training","authors":"Jonathan Lew, Y. Liu, Wenyi Gong, Negar Goli, R. D. Evans, Tor M. Aamodt","doi":"10.1145/3470496.3527404","DOIUrl":"https://doi.org/10.1145/3470496.3527404","url":null,"abstract":"Deep Neural Networks (DNNs) are the state of art in image, speech, and text processing. To address long training times and high energy consumption, custom accelerators can exploit sparsity, that is zero-valued weights, activations, and gradients. Proposed sparse Convolution Neural Network (CNN) accelerators support training with no more than one dynamic sparse convolution input. Among existing accelerator classes, the only ones supporting two-sided dynamic sparsity are outer-product-based accelerators. However, when mapping a convolution onto an outer product, multiplications occur that do not correspond to any valid output. These Redundant Cartesian Products (RCPs) decrease energy efficiency and performance. We observe that in sparse training, up to 90% of computations are RCPs resulting from the convolution of large matrices for weight updates during the backward pass of CNN training. In this work, we design a mechanism, ANT, to anticipate and eliminate RCPs, enabling more efficient sparse training when integrated with an outer-product accelerator. By anticipating over 90% of RCPs, ANT achieves a geometric mean of 3.71× speed up over an SCNN-like accelerator [67] on 90% sparse training using DenseNet-121 [38], ResNet18 [35], VGG16 [73], Wide ResNet (WRN) [85], and ResNet-50 [35], with 4.40× decrease in energy consumption and 0.0017mm2 of additional area. We extend ANT to sparse matrix multiplication, so that the same accelerator can anticipate RCPs in sparse fully-connected layers, transformers, and RNNs.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131732243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fidas Fidas

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3533043

Jian Chen, Xiaoyu Zhang, Tao Wang, Ying Zhang, Tao Chen, Jiajun Chen, Mingxu Xie, Qiang Liu

Network intrusion detection systems (IDS) are crucial for secure cloud computing, but they are also severely constrained by CPU computation capacity as the network bandwidth increases. Therefore, hardware offloading is essential for the IDS servers to support the ever-growing throughput demand for packet processing. Based on the experience of large-scale IDS deployment, we find the existing hardware offloading solutions have fundamental limitations that prevent them from being massively deployed in the production environment. In this paper, we present Fidas, an FPGA-based intrusion detection offload system that avoids the limitations of the existing hardware solutions by comprehensively offloading the primary NIC, rule pattern matching, and traffic flow rate classification. The pattern matching module in Fidas uses a multi-level filter-based approach for efficient regex processing, and the flow rate classification module employs a novel dual-stack memory scheme to identify the hot flows under volumetric attacks. Our evaluation shows that Fidas achieves the state-of-the-art throughput in pattern matching and flow rate classification while freeing up processors for other security-related functionalities. Fidas is deployed in the production data center and has been battle-tested for its performance, cost-effectiveness, and DevOps agility.

{"title":"Fidas","authors":"Jian Chen, Xiaoyu Zhang, Tao Wang, Ying Zhang, Tao Chen, Jiajun Chen, Mingxu Xie, Qiang Liu","doi":"10.1145/3470496.3533043","DOIUrl":"https://doi.org/10.1145/3470496.3533043","url":null,"abstract":"Network intrusion detection systems (IDS) are crucial for secure cloud computing, but they are also severely constrained by CPU computation capacity as the network bandwidth increases. Therefore, hardware offloading is essential for the IDS servers to support the ever-growing throughput demand for packet processing. Based on the experience of large-scale IDS deployment, we find the existing hardware offloading solutions have fundamental limitations that prevent them from being massively deployed in the production environment. In this paper, we present Fidas, an FPGA-based intrusion detection offload system that avoids the limitations of the existing hardware solutions by comprehensively offloading the primary NIC, rule pattern matching, and traffic flow rate classification. The pattern matching module in Fidas uses a multi-level filter-based approach for efficient regex processing, and the flow rate classification module employs a novel dual-stack memory scheme to identify the hot flows under volumetric attacks. Our evaluation shows that Fidas achieves the state-of-the-art throughput in pattern matching and flow rate classification while freeing up processors for other security-related functionalities. Fidas is deployed in the production data center and has been battle-tested for its performance, cost-effectiveness, and DevOps agility.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128741843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Themis 忒弥斯

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527382

Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivasa R. Sridharan, T. Krishna

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multidimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72× (2.70× max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49× (2.25× max), 1.30× (1.78× max), 1.30× (1.77× max), and 1.25× (1.53× max), respectively.

{"title":"Themis","authors":"Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivasa R. Sridharan, T. Krishna","doi":"10.1145/3470496.3527382","DOIUrl":"https://doi.org/10.1145/3470496.3527382","url":null,"abstract":"Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multidimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72× (2.70× max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49× (2.25× max), 1.30× (1.78× max), 1.30× (1.77× max), and 1.25× (1.53× max), respectively.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114761716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

PACMAN PACMAN

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/2633948.2633951

Joseph Ravichandran, Weon Taek Na, Jay Lang, Mengjia Yan

Clock grid is a mainstream clock network methodology for high performance microprocessor and SOC designs. Clock skew, power usage and robustness to PVT (power, voltage, temperature) are all important metrics for a high quality clock grid design. Tree-driven-grid clock network is a typical clock grid clock network. It includes a clock source, a buffered tree, leaf buffers, a mesh clock grid, local clock buffers, and latches as shown in Fig. 1. For such network, one big challenge is how to connect the leaf level buffers of the global tree to the grid with nonuniform loads under tight slew and skew constraints. The choice of tapping points that connect the leaf buffers to the clock grid are critical to the quality of the clock designs. Good tapping points can minimize the clock skew and reduce power. In this paper, we proposed a new algorithm to select the tapping points to build the global tree as regular and symmetric as possible. From our experimental results, the proposed algorithm can efficiently reduce global clock skew, rising slew, maximum overshoot, reduce power, and avoid local skew violation.

引用次数: 1

TDGraph TDGraph

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527409

Jin Zhao, Yun Yang, Yu Zhang, Xiaofei Liao, Lin Gu, Ligang He, Bin He, Hai Jin, Haikun Liu, Xinyu Jiang, Hui Yu

Many solutions have been recently proposed to support the processing of streaming graphs. However, for the processing of each graph snapshot of a streaming graph, the new states of the vertices affected by the graph updates are propagated irregularly along the graph topology. Despite the years' research efforts, existing approaches still suffer from the serious problems of redundant computation overhead and irregular memory access, which severely underutilizes a many-core processor. To address these issues, this paper proposes a topology-driven programmable accelerator TDGraph, which is the first accelerator to augment the many-core processors to achieve high performance processing of streaming graphs. Specifically, we propose an efficient topology-driven incremental execution approach into the accelerator design for more regular state propagation and better data locality. TDGraph takes the vertices affected by graph updates as the roots to prefetch other vertices along the graph topology and synchronizes the incremental computations of them on the fly. In this way, most state propagations originated from multiple vertices affected by different graph updates can be conducted together along the graph topology, which help reduce the redundant computations and data access cost. Besides, through the efficient coalescing of the accesses to vertex states, TDGraph further improves the utilization of the cache and memory bandwidth. We have evaluated TDGraph on a simulated 64-core processor. The results show that, the state-of-the-art software system achieves the speedup of 7.1~21.4 times after integrating with TDGraph, while incurring only 0.73% area cost. Compared with four cutting-edge accelerators, i.e., HATS, Minnow, PHI, and DepGraph, TDGraph gains the speedups of 4.6~12.7, 3.2~8.6, 3.8~9.7, and 2.3~6.1 times, respectively.

{"title":"TDGraph","authors":"Jin Zhao, Yun Yang, Yu Zhang, Xiaofei Liao, Lin Gu, Ligang He, Bin He, Hai Jin, Haikun Liu, Xinyu Jiang, Hui Yu","doi":"10.1145/3470496.3527409","DOIUrl":"https://doi.org/10.1145/3470496.3527409","url":null,"abstract":"Many solutions have been recently proposed to support the processing of streaming graphs. However, for the processing of each graph snapshot of a streaming graph, the new states of the vertices affected by the graph updates are propagated irregularly along the graph topology. Despite the years' research efforts, existing approaches still suffer from the serious problems of redundant computation overhead and irregular memory access, which severely underutilizes a many-core processor. To address these issues, this paper proposes a topology-driven programmable accelerator TDGraph, which is the first accelerator to augment the many-core processors to achieve high performance processing of streaming graphs. Specifically, we propose an efficient topology-driven incremental execution approach into the accelerator design for more regular state propagation and better data locality. TDGraph takes the vertices affected by graph updates as the roots to prefetch other vertices along the graph topology and synchronizes the incremental computations of them on the fly. In this way, most state propagations originated from multiple vertices affected by different graph updates can be conducted together along the graph topology, which help reduce the redundant computations and data access cost. Besides, through the efficient coalescing of the accesses to vertex states, TDGraph further improves the utilization of the cache and memory bandwidth. We have evaluated TDGraph on a simulated 64-core processor. The results show that, the state-of-the-art software system achieves the speedup of 7.1~21.4 times after integrating with TDGraph, while incurring only 0.73% area cost. Compared with four cutting-edge accelerators, i.e., HATS, Minnow, PHI, and DepGraph, TDGraph gains the speedups of 4.6~12.7, 3.2~8.6, 3.8~9.7, and 2.3~6.1 times, respectively.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"506 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120882129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

AI accelerator on IBM Telum processor: industrial product 基于IBM Telum处理器的AI加速器:工业产品

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3533042

C. Lichtenau, A. Buyuktosunoglu, Ramon Bertran Monfort, P. Figuli, C. Jacobi, N. Papandreou, H. Pozidis, A. Saporito, Andrew Sica, Elpida Tzortzatos

IBM Telum is the next generation processor chip for IBM Z and LinuxONE systems. The Telum design is focused on enterprise class workloads and it achieves over 40% per socket performance growth compared to IBM z15. The IBM Telum is the first server-class chip with a dedicated on-chip AI accelerator that enables clients to gain real time insights from their data as it is getting processed. Seamlessly infusing AI in all enterprise workloads is highly desirable to get real business insight on every transaction as well as to improve IT operation, security, and data privacy. While it would undeniably provide significant additional value, its application in practice is often accompanied by hurdles from low throughput if run on-platform to security concerns and inconsistent latency if run off-platform. The IBM Telum chip introduces an on-chip AI accelerator that provides consistent low latency and high throughput (over 200 TFLOPS in 32 chip system) inference capacity usable by all threads. The accelerator is memory coherent and directly connected to the fabric like any other general-purpose core to support low latency inference while meeting the system's transaction rate. A scalable architecture providing transparent access to AI accelerator functions via a non-privileged general-purpose core instruction further reduces software orchestration and library complexity as well as provides extensibility to the AI functions. On a global bank customer credit card fraud detection model, the AI accelerator achieves 22× speed up in latency compared to a general purpose core utilizing vector execution units. For the same model, the AI accelerator achieves 116k inferences every second with a latency of only 1.1 msec. As the system is scaled up from one chip to 32 chips, it performs more than 3.5 Million inferences/sec and the latency still stays very low at only 1.2 msec. This paper briefly introduces the IBM Telum chip and later describes the integrated AI accelerator. IBM Telum's AI accelerator architecture, microarchitecture, integration into the system stack, performance, and power are covered in detail.

IBM Telum是IBM Z和LinuxONE系统的下一代处理器芯片。Telum设计专注于企业级工作负载，与IBM z15相比，它实现了超过40%的每个插槽性能增长。IBM Telum是第一款具有专用片上AI加速器的服务器级芯片，它使客户能够从正在处理的数据中获得实时洞察。在所有企业工作负载中无缝地注入人工智能是非常可取的，以便在每笔交易中获得真正的业务洞察力，并改善IT运营、安全性和数据隐私。虽然不可否认它将提供重要的附加价值，但它在实践中的应用通常伴随着障碍，从在平台上运行的低吞吐量到在平台外运行的安全问题和不一致的延迟。IBM Telum芯片引入了一个片上AI加速器，它提供了所有线程可用的一致的低延迟和高吞吐量(32芯片系统中超过200 TFLOPS)推理能力。加速器是内存一致的，并像任何其他通用核心一样直接连接到结构，以支持低延迟推理，同时满足系统的事务速率。可扩展的架构通过非特权通用核心指令提供对AI加速器功能的透明访问，进一步降低了软件编排和库的复杂性，并为AI功能提供了可扩展性。在全球银行客户信用卡欺诈检测模型中，与使用矢量执行单元的通用核心相比，AI加速器的延迟速度提高了22倍。对于相同的模型，AI加速器每秒实现116k次推理，延迟仅为1.1毫秒。当系统从一个芯片扩展到32个芯片时，它执行超过350万次推理/秒，延迟仍然保持在非常低的1.2毫秒。本文简要介绍了IBM Telum芯片，然后介绍了集成的AI加速器。详细介绍了IBM Telum的AI加速器架构、微架构、集成到系统堆栈、性能和功耗。

{"title":"AI accelerator on IBM Telum processor: industrial product","authors":"C. Lichtenau, A. Buyuktosunoglu, Ramon Bertran Monfort, P. Figuli, C. Jacobi, N. Papandreou, H. Pozidis, A. Saporito, Andrew Sica, Elpida Tzortzatos","doi":"10.1145/3470496.3533042","DOIUrl":"https://doi.org/10.1145/3470496.3533042","url":null,"abstract":"IBM Telum is the next generation processor chip for IBM Z and LinuxONE systems. The Telum design is focused on enterprise class workloads and it achieves over 40% per socket performance growth compared to IBM z15. The IBM Telum is the first server-class chip with a dedicated on-chip AI accelerator that enables clients to gain real time insights from their data as it is getting processed. Seamlessly infusing AI in all enterprise workloads is highly desirable to get real business insight on every transaction as well as to improve IT operation, security, and data privacy. While it would undeniably provide significant additional value, its application in practice is often accompanied by hurdles from low throughput if run on-platform to security concerns and inconsistent latency if run off-platform. The IBM Telum chip introduces an on-chip AI accelerator that provides consistent low latency and high throughput (over 200 TFLOPS in 32 chip system) inference capacity usable by all threads. The accelerator is memory coherent and directly connected to the fabric like any other general-purpose core to support low latency inference while meeting the system's transaction rate. A scalable architecture providing transparent access to AI accelerator functions via a non-privileged general-purpose core instruction further reduces software orchestration and library complexity as well as provides extensibility to the AI functions. On a global bank customer credit card fraud detection model, the AI accelerator achieves 22× speed up in latency compared to a general purpose core utilizing vector execution units. For the same model, the AI accelerator achieves 116k inferences every second with a latency of only 1.1 msec. As the system is scaled up from one chip to 32 chips, it performs more than 3.5 Million inferences/sec and the latency still stays very low at only 1.2 msec. This paper briefly introduces the IBM Telum chip and later describes the integrated AI accelerator. IBM Telum's AI accelerator architecture, microarchitecture, integration into the system stack, performance, and power are covered in detail.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122111348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

BTS BTS

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527415

In presence of permafrost, the negative heat flux coming up from the cold frozen subsurface will lead to strongly negative WEqT (typically less than -2 °C), whereas on non frozen soil, the WEqT will be close to 0 °C or moderately negative (Haeberli 1973). Thus the WEqT can be a good indicator of permafrost occurrence and can help to discriminate permafrost from non-permafrost areas, provided that the snow cover developped early in the winter and remained sufficient to isolate the soil surface from atmospheric influence.

在存在永久冻土的情况下，来自寒冷冻结地下的负热通量将导致强烈的负WEqT(通常小于-2°C)，而在非冻土上，WEqT将接近0°C或中度负(Haeberli 1973)。因此，WEqT可以很好地指示永久冻土的发生，并有助于区分永久冻土和非永久冻土地区，前提是积雪在冬季早期形成，并保持足够的厚度，使土壤表面不受大气影响。

引用次数: 65

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527398

Sudhanshu Shukla, Sumeet Bandishte, Jayesh Gaur, S. Subramoney

The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that the memory wall is not monolithic, but is constituted of many latency walls arising due to the latency of each tier of cache/memory. Our results show that even though level-1 (L1) data cache latency is nearly 40X lower than main memory latency, mitigating this latency offers a very similar performance opportunity as the more widely studied, main memory latency. This motivates our proposal Register File Prefetch (RFP) that intelligently utilizes the existing OOO scheduling pipeline and available L1 data cache/Register File bandwidth to successfully prefetch 43.4% of load requests from the L1 cache to the Register File. Simulation results on 65 diverse workloads show that this translates to 3.1% performance gain over a baseline with parameters similar to Intel Tiger Lake processor, which further increases to 5.7% for a futuristic up-scaled core. We also contrast and differentiate register file prefetching from techniques like load value and address prediction that enhance performance by speculatively breaking data dependencies. Our analysis shows that RFP is synergistic with value prediction, with both the features together delivering 4.1% average performance improvement, which is significantly higher than the 2.2% performance gain obtained from just doing value prediction.

内存墙继续限制着现代无序(OOO)处理器的性能，尽管大型多级缓存的配置非常昂贵，并且内存预取也取得了进步。在本文中，我们提出了一个重要的观察，即内存墙不是单一的，而是由许多延迟墙组成的，这些延迟墙是由每层缓存/内存的延迟引起的。我们的结果表明，尽管1级(L1)数据缓存延迟比主存延迟低近40倍，但减少这种延迟提供的性能机会与研究更广泛的主存延迟非常相似。这激发了我们提出的注册文件预取(RFP)方案，该方案智能地利用现有的OOO调度管道和可用的L1数据缓存/注册文件带宽，成功地将43.4%的负载请求从L1缓存预取到注册文件。在65种不同工作负载上的模拟结果表明，与参数类似于英特尔Tiger Lake处理器的基准相比，这意味着性能提高了3.1%，对于未来的大规模核心，性能提高了5.7%。我们还将寄存器文件预取与加载值和地址预测等技术进行了对比和区分，这些技术通过推测性地打破数据依赖关系来提高性能。我们的分析表明，RFP与价值预测是协同的，这两个特性共同提供了4.1%的平均性能提升，这明显高于仅进行价值预测所获得的2.2%的性能提升。

{"title":"Register file prefetching","authors":"Sudhanshu Shukla, Sumeet Bandishte, Jayesh Gaur, S. Subramoney","doi":"10.1145/3470496.3527398","DOIUrl":"https://doi.org/10.1145/3470496.3527398","url":null,"abstract":"The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that the memory wall is not monolithic, but is constituted of many latency walls arising due to the latency of each tier of cache/memory. Our results show that even though level-1 (L1) data cache latency is nearly 40X lower than main memory latency, mitigating this latency offers a very similar performance opportunity as the more widely studied, main memory latency. This motivates our proposal Register File Prefetch (RFP) that intelligently utilizes the existing OOO scheduling pipeline and available L1 data cache/Register File bandwidth to successfully prefetch 43.4% of load requests from the L1 cache to the Register File. Simulation results on 65 diverse workloads show that this translates to 3.1% performance gain over a baseline with parameters similar to Intel Tiger Lake processor, which further increases to 5.7% for a futuristic up-scaled core. We also contrast and differentiate register file prefetching from techniques like load value and address prediction that enhance performance by speculatively breaking data dependencies. Our analysis shows that RFP is synergistic with value prediction, with both the features together delivering 4.1% average performance improvement, which is significantly higher than the 2.2% performance gain obtained from just doing value prediction.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130374551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 49th Annual International Symposium on Computer Architecture

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀